Science.gov

Sample records for parallel programming ppopp

  1. Introduction to parallel programming

    SciTech Connect

    Brawer, S. )

    1989-01-01

    This book describes parallel programming and all the basic concepts illustrated by examples in a simplified FORTRAN. Concepts covered include: The parallel programming model; The creation of multiple processes; Memory sharing; Scheduling; Data dependencies. In addition, a number of parallelized applications are presented, including a discrete-time, discrete-event simulator, numerical integration, Gaussian elimination, and parallelized versions of the traveling salesman problem and the exploration of a maze.

  2. Parallel programming with PCN

    SciTech Connect

    Foster, I.; Tuecke, S.

    1991-12-01

    PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and C that allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. In includes both tutorial and reference material. It also presents the basic concepts that underly PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous FTP from Argonne National Laboratory in the directory pub/pcn at info.mcs.anl.gov (c.f. Appendix A).

  3. Parallel programming with PCN

    SciTech Connect

    Foster, I.; Tuecke, S.

    1991-09-01

    PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, a set of tools for developing and debugging programs in this language, and interfaces to Fortran and C that allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. It includes both tutorial and reference material. It also presents the basic concepts that underlie PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous FTP from Argonne National Laboratory at info.mcs.anl.gov.

  4. Bilingual parallel programming

    SciTech Connect

    Foster, I.; Overbeek, R.

    1990-01-01

    Numerous experiments have demonstrated that computationally intensive algorithms support adequate parallelism to exploit the potential of large parallel machines. Yet successful parallel implementations of serious applications are rare. The limiting factor is clearly programming technology. None of the approaches to parallel programming that have been proposed to date -- whether parallelizing compilers, language extensions, or new concurrent languages -- seem to adequately address the central problems of portability, expressiveness, efficiency, and compatibility with existing software. In this paper, we advocate an alternative approach to parallel programming based on what we call bilingual programming. We present evidence that this approach provides and effective solution to parallel programming problems. The key idea in bilingual programming is to construct the upper levels of applications in a high-level language while coding selected low-level components in low-level languages. This approach permits the advantages of a high-level notation (expressiveness, elegance, conciseness) to be obtained without the cost in performance normally associated with high-level approaches. In addition, it provides a natural framework for reusing existing code.

  5. Parallel programming with PCN

    SciTech Connect

    Foster, I.; Tuecke, S.

    1993-01-01

    PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and Cthat allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. It includes both tutorial and reference material. It also presents the basic concepts that underlie PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous ftp from Argonne National Laboratory in the directory pub/pcn at info.mcs. ani.gov (cf. Appendix A). This version of this document describes PCN version 2.0, a major revision of the PCN programming system. It supersedes earlier versions of this report.

  6. Parallel programming with Ada

    SciTech Connect

    Kok, J.

    1988-01-01

    To the human programmer the ease of coding distributed computing is highly dependent on the suitability of the employed programming language. But with a particular language it is also important whether the possibilities of one or more parallel architectures can efficiently be addressed by available language constructs. In this paper the possibilities are discussed of the high-level language Ada and in particular of its tasking concept as a descriptional tool for the design and implementation of numerical and other algorithms that allow execution of parts in parallel. Language tools are explained and their use for common applications is shown. Conclusions are drawn about the usefulness of several Ada concepts.

  7. Tolerant (parallel) Programming

    NASA Technical Reports Server (NTRS)

    DiNucci, David C.; Bailey, David H. (Technical Monitor)

    1997-01-01

    In order to be truly portable, a program must be tolerant of a wide range of development and execution environments, and a parallel program is just one which must be tolerant of a very wide range. This paper first defines the term "tolerant programming", then describes many layers of tools to accomplish it. The primary focus is on F-Nets, a formal model for expressing computation as a folded partial-ordering of operations, thereby providing an architecture-independent expression of tolerant parallel algorithms. For implementing F-Nets, Cooperative Data Sharing (CDS) is a subroutine package for implementing communication efficiently in a large number of environments (e.g. shared memory and message passing). Software Cabling (SC), a very-high-level graphical programming language for building large F-Nets, possesses many of the features normally expected from today's computer languages (e.g. data abstraction, array operations). Finally, L2(sup 3) is a CASE tool which facilitates the construction, compilation, execution, and debugging of SC programs.

  8. Parallel processor engine model program

    NASA Technical Reports Server (NTRS)

    Mclaughlin, P.

    1984-01-01

    The Parallel Processor Engine Model Program is a generalized engineering tool intended to aid in the design of parallel processing real-time simulations of turbofan engines. It is written in the FORTRAN programming language and executes as a subset of the SOAPP simulation system. Input/output and execution control are provided by SOAPP; however, the analysis, emulation and simulation functions are completely self-contained. A framework in which a wide variety of parallel processing architectures could be evaluated and tools with which the parallel implementation of a real-time simulation technique could be assessed are provided.

  9. Information hiding in parallel programs

    SciTech Connect

    Foster, I.

    1992-01-30

    A fundamental principle in program design is to isolate difficult or changeable design decisions. Application of this principle to parallel programs requires identification of decisions that are difficult or subject to change, and the development of techniques for hiding these decisions. We experiment with three complex applications, and identify mapping, communication, and scheduling as areas in which decisions are particularly problematic. We develop computational abstractions that hide such decisions, and show that these abstractions can be used to develop elegant solutions to programming problems. In particular, they allow us to encode common structures, such as transforms, reductions, and meshes, as software cells and templates that can reused in different applications. An important characteristic of these structures is that they do not incorporate mapping, communication, or scheduling decisions: these aspects of the design are specified separately, when composing existing structures to form applications. This separation of concerns allows the same cells and templates to be reused in different contexts.

  10. Global Arrays Parallel Programming Toolkit

    SciTech Connect

    Nieplocha, Jaroslaw; Krishnan, Manoj Kumar; Palmer, Bruce J.; Tipparaju, Vinod; Harrison, Robert J.; Chavarría-Miranda, Daniel

    2011-01-01

    The two predominant classes of programming models for parallel computing are distributed memory and shared memory. Both shared memory and distributed memory models have advantages and shortcomings. Shared memory model is much easier to use but it ignores data locality/placement. Given the hierarchical nature of the memory subsystems in modern computers this characteristic can have a negative impact on performance and scalability. Careful code restructuring to increase data reuse and replacing fine grain load/stores with block access to shared data can address the problem and yield performance for shared memory that is competitive with message-passing. However, this performance comes at the cost of compromising the ease of use that the shared memory model advertises. Distributed memory models, such as message-passing or one-sided communication, offer performance and scalability but they are difficult to program. The Global Arrays toolkit attempts to offer the best features of both models. It implements a shared-memory programming model in which data locality is managed by the programmer. This management is achieved by calls to functions that transfer data between a global address space (a distributed array) and local storage. In this respect, the GA model has similarities to the distributed shared-memory models that provide an explicit acquire/release protocol. However, the GA model acknowledges that remote data is slower to access than local data and allows data locality to be specified by the programmer and hence managed. GA is related to the global address space languages such as UPC, Titanium, and, to a lesser extent, Co-Array Fortran. In addition, by providing a set of data-parallel operations, GA is also related to data-parallel languages such as HPF, ZPL, and Data Parallel C. However, the Global Array programming model is implemented as a library that works with most languages used for technical computing and does not rely on compiler technology for achieving

  11. A parallel programming environment supporting multiple data-parallel modules

    SciTech Connect

    Seevers, B.K.; Quinn, M.J. ); Hatcher, P.J. )

    1992-10-01

    We describe a system that allows programmers to take advantage of both control and data parallelism through multiple intercommunicating data-parallel modules. This programming environment extends C-type stream I/O to include intermodule communication channels. The progammer writes each module as a separate data-parallel program, then develops a channel linker specification describing how to connect the modules together. A channel linker we have developed loads the separate modules on the parallel machine and binds the communication channels together as specified. We present performance data that demonstrates a mixed control- and data-parallel solution can yield better performance than a strictly data-parallel solution. The system described currently runs on the Intel iWarp multicomputer.

  12. The ParaScope parallel programming environment

    NASA Technical Reports Server (NTRS)

    Cooper, Keith D.; Hall, Mary W.; Hood, Robert T.; Kennedy, Ken; Mckinley, Kathryn S.; Mellor-Crummey, John M.; Torczon, Linda; Warren, Scott K.

    1993-01-01

    The ParaScope parallel programming environment, developed to support scientific programming of shared-memory multiprocessors, includes a collection of tools that use global program analysis to help users develop and debug parallel programs. This paper focuses on ParaScope's compilation system, its parallel program editor, and its parallel debugging system. The compilation system extends the traditional single-procedure compiler by providing a mechanism for managing the compilation of complete programs. Thus, ParaScope can support both traditional single-procedure optimization and optimization across procedure boundaries. The ParaScope editor brings both compiler analysis and user expertise to bear on program parallelization. It assists the knowledgeable user by displaying and managing analysis and by providing a variety of interactive program transformations that are effective in exposing parallelism. The debugging system detects and reports timing-dependent errors, called data races, in execution of parallel programs. The system combines static analysis, program instrumentation, and run-time reporting to provide a mechanical system for isolating errors in parallel program executions. Finally, we describe a new project to extend ParaScope to support programming in FORTRAN D, a machine-independent parallel programming language intended for use with both distributed-memory and shared-memory parallel computers.

  13. Parallel Programming in the Age of Ubiquitous Parallelism

    NASA Astrophysics Data System (ADS)

    Pingali, Keshav

    2014-04-01

    Multicore and manycore processors are now ubiquitous, but parallel programming remains as difficult as it was 30-40 years ago. During this time, our community has explored many promising approaches including functional and dataflow languages, logic programming, and automatic parallelization using program analysis and restructuring, but none of these approaches has succeeded except in a few niche application areas. In this talk, I will argue that these problems arise largely from the computation-centric foundations and abstractions that we currently use to think about parallelism. In their place, I will propose a novel data-centric foundation for parallel programming called the operator formulation in which algorithms are described in terms of actions on data. The operator formulation shows that a generalized form of data-parallelism called amorphous data-parallelism is ubiquitous even in complex, irregular graph applications such as mesh generation/refinement/partitioning and SAT solvers. Regular algorithms emerge as a special case of irregular ones, and many application-specific optimization techniques can be generalized to a broader context. The operator formulation also leads to a structural analysis of algorithms called TAO-analysis that provides implementation guidelines for exploiting parallelism efficiently. Finally, I will describe a system called Galois based on these ideas for exploiting amorphous data-parallelism on multicores and GPUs

  14. Parallelized direct execution simulation of message-passing parallel programs

    NASA Technical Reports Server (NTRS)

    Dickens, Phillip M.; Heidelberger, Philip; Nicol, David M.

    1994-01-01

    As massively parallel computers proliferate, there is growing interest in findings ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing computers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, Large Application Parallel Simulation Environment (LAPSE), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well typically within 10 percent relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.

  15. Parallel programming in Split-C

    SciTech Connect

    Culler, D.E.; Dusseau, A.; Goldstein, S.C.; Krishnamurthy, A.; Lumetta, S.; Eicken, T. von; Yelick, K.

    1993-12-31

    The authors introduce the Split-C language, a parallel extension of C intended for high performance programming on distributed memory multiprocessors, and demonstrate the use of the language in optimizing parallel programs. Split-C provides a global address space with a clear concept of locality and unusual assignment operators. These are used as tools to reduce the frequency and cost of remote access. The language allows a mixture of shared memory, message passing, and data parallel programming styles while providing efficient access to the underlying machine. They demonstrate the basic language concepts using regular and irregular parallel programs and give performance results for various stages of program optimization.

  16. Genetic Parallel Programming: design and implementation.

    PubMed

    Cheang, Sin Man; Leung, Kwong Sak; Lee, Kin Hong

    2006-01-01

    This paper presents a novel Genetic Parallel Programming (GPP) paradigm for evolving parallel programs running on a Multi-Arithmetic-Logic-Unit (Multi-ALU) Processor (MAP). The MAP is a Multiple Instruction-streams, Multiple Data-streams (MIMD), general-purpose register machine that can be implemented on modern Very Large-Scale Integrated Circuits (VLSIs) in order to evaluate genetic programs at high speed. For human programmers, writing parallel programs is more difficult than writing sequential programs. However, experimental results show that GPP evolves parallel programs with less computational effort than that of their sequential counterparts. It creates a new approach to evolving a feasible problem solution in parallel program form and then serializes it into a sequential program if required. The effectiveness and efficiency of GPP are investigated using a suite of 14 well-studied benchmark problems. Experimental results show that GPP speeds up evolution substantially.

  17. Shared-memory parallel programming in C++

    SciTech Connect

    Beck, B. )

    1990-07-01

    This paper discusses how researchers have produced a set of portable parallel-programming constructs for C, implemented in M4 macros. These parallel-programming macros are available under the name Parmacs. The Parmacs macros let one write parallel C programs for shared-memory, distributed-memory, and mixed-memory (shared and distributed) systems. They have been implemented on several machines. Because Parmacs offers useful parallel-programming features, the author has considered how these problems might be overcome or avoided. The author thought that using C++, rather than C, would address these problems adequately, and describes the C++ features exploited. The work described addresses shared-memory constructs.

  18. IOPA: I/O-aware parallelism adaption for parallel programs

    PubMed Central

    Liu, Tao; Liu, Yi; Qian, Chen; Qian, Depei

    2017-01-01

    With the development of multi-/many-core processors, applications need to be written as parallel programs to improve execution efficiency. For data-intensive applications that use multiple threads to read/write files simultaneously, an I/O sub-system can easily become a bottleneck when too many of these types of threads exist; on the contrary, too few threads will cause insufficient resource utilization and hurt performance. Therefore, programmers must pay much attention to parallelism control to find the appropriate number of I/O threads for an application. This paper proposes a parallelism control mechanism named IOPA that can adjust the parallelism of applications to adapt to the I/O capability of a system and balance computing resources and I/O bandwidth. The programming interface of IOPA is also provided to programmers to simplify parallel programming. IOPA is evaluated using multiple applications with both solid state and hard disk drives. The results show that the parallel applications using IOPA can achieve higher efficiency than those with a fixed number of threads. PMID:28278236

  19. Language constructs for modular parallel programs

    SciTech Connect

    Foster, I.

    1996-03-01

    We describe programming language constructs that facilitate the application of modular design techniques in parallel programming. These constructs allow us to isolate resource management and processor scheduling decisions from the specification of individual modules, which can themselves encapsulate design decisions concerned with concurrence, communication, process mapping, and data distribution. This approach permits development of libraries of reusable parallel program components and the reuse of these components in different contexts. In particular, alternative mapping strategies can be explored without modifying other aspects of program logic. We describe how these constructs are incorporated in two practical parallel programming languages, PCN and Fortran M. Compilers have been developed for both languages, allowing experimentation in substantial applications.

  20. Parallel programming with PCN. Revision 1

    SciTech Connect

    Foster, I.; Tuecke, S.

    1991-12-01

    PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and C that allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. In includes both tutorial and reference material. It also presents the basic concepts that underly PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous FTP from Argonne National Laboratory in the directory pub/pcn at info.mcs.anl.gov (c.f. Appendix A).

  1. Parallel programming of industrial applications

    SciTech Connect

    Heroux, M; Koniges, A; Simon, H

    1998-07-21

    In the introductory material, we overview the typical MPP environment for real application computing and the special tools available such as parallel debuggers and performance analyzers. Next, we draw from a series of real applications codes and discuss the specific challenges and problems that are encountered in parallelizing these individual applications. The application areas drawn from include biomedical sciences, materials processing and design, plasma and fluid dynamics, and others. We show how it was possible to get a particular application to run efficiently and what steps were necessary. Finally we end with a summary of the lessons learned from these applications and predictions for the future of industrial parallel computing. This tutorial is based on material from a forthcoming book entitled: "Industrial Strength Parallel Computing" to be published by Morgan Kaufmann Publishers (ISBN l-55860-54).

  2. Towards Distributed Memory Parallel Program Analysis

    SciTech Connect

    Quinlan, D; Barany, G; Panas, T

    2008-06-17

    This paper presents a parallel attribute evaluation for distributed memory parallel computer architectures where previously only shared memory parallel support for this technique has been developed. Attribute evaluation is a part of how attribute grammars are used for program analysis within modern compilers. Within this work, we have extended ROSE, a open compiler infrastructure, with a distributed memory parallel attribute evaluation mechanism to support user defined global program analysis required for some forms of security analysis which can not be addressed by a file by file view of large scale applications. As a result, user defined security analyses may now run in parallel without the user having to specify the way data is communicated between processors. The automation of communication enables an extensible open-source parallel program analysis infrastructure.

  3. A survey of parallel programming tools

    NASA Technical Reports Server (NTRS)

    Cheng, Doreen Y.

    1991-01-01

    This survey examines 39 parallel programming tools. Focus is placed on those tool capabilites needed for parallel scientific programming rather than for general computer science. The tools are classified with current and future needs of Numerical Aerodynamic Simulator (NAS) in mind: existing and anticipated NAS supercomputers and workstations; operating systems; programming languages; and applications. They are divided into four categories: suggested acquisitions, tools already brought in; tools worth tracking; and tools eliminated from further consideration at this time.

  4. Hybrid parallel programming with MPI and Unified Parallel C.

    SciTech Connect

    Dinan, J.; Balaji, P.; Lusk, E.; Sadayappan, P.; Thakur, R.; Mathematics and Computer Science; The Ohio State Univ.

    2010-01-01

    The Message Passing Interface (MPI) is one of the most widely used programming models for parallel computing. However, the amount of memory available to an MPI process is limited by the amount of local memory within a compute node. Partitioned Global Address Space (PGAS) models such as Unified Parallel C (UPC) are growing in popularity because of their ability to provide a shared global address space that spans the memories of multiple compute nodes. However, taking advantage of UPC can require a large recoding effort for existing parallel applications. In this paper, we explore a new hybrid parallel programming model that combines MPI and UPC. This model allows MPI programmers incremental access to a greater amount of memory, enabling memory-constrained MPI codes to process larger data sets. In addition, the hybrid model offers UPC programmers an opportunity to create static UPC groups that are connected over MPI. As we demonstrate, the use of such groups can significantly improve the scalability of locality-constrained UPC codes. This paper presents a detailed description of the hybrid model and demonstrates its effectiveness in two applications: a random access benchmark and the Barnes-Hut cosmological simulation. Experimental results indicate that the hybrid model can greatly enhance performance; using hybrid UPC groups that span two cluster nodes, RA performance increases by a factor of 1.33 and using groups that span four cluster nodes, Barnes-Hut experiences a twofold speedup at the expense of a 2% increase in code size.

  5. Parallel processor programs in the Federal Government

    NASA Technical Reports Server (NTRS)

    Schneck, P. B.; Austin, D.; Squires, S. L.; Lehmann, J.; Mizell, D.; Wallgren, K.

    1985-01-01

    In 1982, a report dealing with the nation's research needs in high-speed computing called for increased access to supercomputing resources for the research community, research in computational mathematics, and increased research in the technology base needed for the next generation of supercomputers. Since that time a number of programs addressing future generations of computers, particularly parallel processors, have been started by U.S. government agencies. The present paper provides a description of the largest government programs in parallel processing. Established in fiscal year 1985 by the Institute for Defense Analyses for the National Security Agency, the Supercomputing Research Center will pursue research to advance the state of the art in supercomputing. Attention is also given to the DOE applied mathematical sciences research program, the NYU Ultracomputer project, the DARPA multiprocessor system architectures program, NSF research on multiprocessor systems, ONR activities in parallel computing, and NASA parallel processor projects.

  6. Parallel processor programs in the Federal Government

    NASA Technical Reports Server (NTRS)

    Schneck, P. B.; Austin, D.; Squires, S. L.; Lehmann, J.; Mizell, D.; Wallgren, K.

    1985-01-01

    In 1982, a report dealing with the nation's research needs in high-speed computing called for increased access to supercomputing resources for the research community, research in computational mathematics, and increased research in the technology base needed for the next generation of supercomputers. Since that time a number of programs addressing future generations of computers, particularly parallel processors, have been started by U.S. government agencies. The present paper provides a description of the largest government programs in parallel processing. Established in fiscal year 1985 by the Institute for Defense Analyses for the National Security Agency, the Supercomputing Research Center will pursue research to advance the state of the art in supercomputing. Attention is also given to the DOE applied mathematical sciences research program, the NYU Ultracomputer project, the DARPA multiprocessor system architectures program, NSF research on multiprocessor systems, ONR activities in parallel computing, and NASA parallel processor projects.

  7. Graphics-Based Parallel Programming Tools

    DTIC Science & Technology

    1992-01-01

    the Voyeur system [121. Voyeur is a more conventional tool for displaying application-specific visualizations of parallel programs [131 and it provides...Department. University of Massachusetts (June 1991). 13 David Socha and Mary L. Bailey and David Notkin. "’ Voyeur : Graphi- cal Views of Parallel Programs...Massachusetts (September 1991). Nandakumar Varadaraju. Interfacing Belvedere with Voyeur . Master’s The- sis. COINS Department. University of Massachusetts

  8. Integrated Task and Data Parallel Programming

    NASA Technical Reports Server (NTRS)

    Grimshaw, A. S.

    1998-01-01

    This research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers 1995 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program. Additional 1995 Activities During the fall I collaborated

  9. Genetic algorithms using SISAL parallel programming language

    SciTech Connect

    Tejada, S.

    1994-05-06

    Genetic algorithms are a mathematical optimization technique developed by John Holland at the University of Michigan [1]. The SISAL programming language possesses many of the characteristics desired to implement genetic algorithms. SISAL is a deterministic, functional programming language which is inherently parallel. Because SISAL is functional and based on mathematical concepts, genetic algorithms can be efficiently translated into the language. Several of the steps involved in genetic algorithms, such as mutation, crossover, and fitness evaluation, can be parallelized using SISAL. In this paper I will l discuss the implementation and performance of parallel genetic algorithms in SISAL.

  10. Heterogeneous parallel programming capability. Final report

    SciTech Connect

    Flower, J.W.; Kolawa, A.

    1990-11-30

    In creating a heterogeneous parallel processing capability we are really trying to approach three basic problems with current systems: (1) Supercomputer and parallel computer hardware architectures vary widely but need to support one or two fairly standard programming languages and programming models. A particularly important issue concerns the short life cycle of individual hardware designs; (2) Many algorithms require capabilities beyond the reach of single superconducters but could be approached by several machines working together; and (3) Performing a given task requires integration of a system that may contain many components in addition to the super or parallel computer itself. Peripherals from many different manufacturers must be incorporated.

  11. The PISCES 2 parallel programming environment

    NASA Technical Reports Server (NTRS)

    Pratt, Terrence W.

    1987-01-01

    PISCES 2 is a programming environment for scientific and engineering computations on MIMD parallel computers. It is currently implemented on a flexible FLEX/32 at NASA Langley, a 20 processor machine with both shared and local memories. The environment provides an extended Fortran for applications programming, a configuration environment for setting up a run on the parallel machine, and a run-time environment for monitoring and controlling program execution. This paper describes the overall design of the system and its implementation on the FLEX/32. Emphasis is placed on several novel aspects of the design: the use of a carefully defined virtual machine, programmer control of the mapping of virtual machine to actual hardware, forces for medium-granularity parallelism, and windows for parallel distribution of data. Some preliminary measurements of storage use are included.

  12. Parallel programming with PCN. Revision 2

    SciTech Connect

    Foster, I.; Tuecke, S.

    1993-01-01

    PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and Cthat allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. It includes both tutorial and reference material. It also presents the basic concepts that underlie PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous ftp from Argonne National Laboratory in the directory pub/pcn at info.mcs. ani.gov (cf. Appendix A). This version of this document describes PCN version 2.0, a major revision of the PCN programming system. It supersedes earlier versions of this report.

  13. Communication Graph Generator for Parallel Programs

    SciTech Connect

    2014-04-08

    Graphator is a collection of relatively simple sequential programs that generate communication graphs/matrices for commonly occurring patterns in parallel programs. Currently, there is support for five communication patterns: two-dimensional 4-point stencil, four-dimensional 8-point stencil, all-to-alls over sub-communicators, random near-neighbor communication, and near-neighbor communication.

  14. Graphics-Based Parallel Programming Tools

    DTIC Science & Technology

    1991-09-01

    more general context by implementing perspective views within the Voyeur system [121. Voyeur is a more conventional tool for displaying application...Varadaraju. Interfacing Belvedere with Voyeur . Master’s Thesis, COINS Department, University of Massachusetts (June 1991). 13 David Socha and Mary L...Bailey and David Notkin, " Voyeur : Graphi- cal Views of Parallel Programs", SIGPLAN Workshop on Parallel and Distributed Debugging, pp. 206-215 (1988). 14

  15. Voyeur: Graphical Views of Parallel Programs

    DTIC Science & Technology

    1988-04-01

    visualization, parallel debugging, moni torino 20. ABSTRACT (Continue ci reveree, eide if necessary and Identify by block number) ~9 Voyeur is a prototype...PAGE Dh- eaa,:~ Voyeur : Graphical Views of Parallel Programs David Socha, Mary Bailey and David Notkin Department of Computer Science, FR-35 University...of Washington Seattle, Washington 98195 TR 88-04-03 April 1988 Voyeur is a prototype system that facilitates the construction of application-specific

  16. Representing Control in Parallel Applicative Programming

    DTIC Science & Technology

    1994-09-01

    Ease of programming should not come at the expense of expressiveness. Therefore we design a parallel applicative language Pscheme such that programmers...Pscheme in this chapter through our design motivation and short program examples. Formal semantics will be given later in chapter 4. 2.1 Main Features...Principles of Programming Languages, 1988. [16] M. Felleisen . Modeling continuations without continuations. In Annual ACM Sym- posium on Principles of

  17. Simulating Billion-Task Parallel Programs

    SciTech Connect

    Perumalla, Kalyan S; Park, Alfred J

    2014-01-01

    In simulating large parallel systems, bottom-up approaches exercise detailed hardware models with effects from simplified software models or traces, whereas top-down approaches evaluate the timing and functionality of detailed software models over coarse hardware models. Here, we focus on the top-down approach and significantly advance the scale of the simulated parallel programs. Via the direct execution technique combined with parallel discrete event simulation, we stretch the limits of the top-down approach by simulating message passing interface (MPI) programs with millions of tasks. Using a timing-validated benchmark application, a proof-of-concept scaling level is achieved to over 0.22 billion virtual MPI processes on 216,000 cores of a Cray XT5 supercomputer, representing one of the largest direct execution simulations to date, combined with a multiplexing ratio of 1024 simulated tasks per real task.

  18. Parallel Volunteer Learning during Youth Programs

    ERIC Educational Resources Information Center

    Lesmeister, Marilyn K.; Green, Jeremy; Derby, Amy; Bothum, Candi

    2012-01-01

    Lack of time is a hindrance for volunteers to participate in educational opportunities, yet volunteer success in an organization is tied to the orientation and education they receive. Meeting diverse educational needs of volunteers can be a challenge for program managers. Scheduling a Volunteer Learning Track for chaperones that is parallel to a…

  19. Parallel Volunteer Learning during Youth Programs

    ERIC Educational Resources Information Center

    Lesmeister, Marilyn K.; Green, Jeremy; Derby, Amy; Bothum, Candi

    2012-01-01

    Lack of time is a hindrance for volunteers to participate in educational opportunities, yet volunteer success in an organization is tied to the orientation and education they receive. Meeting diverse educational needs of volunteers can be a challenge for program managers. Scheduling a Volunteer Learning Track for chaperones that is parallel to a…

  20. Multiprogramming and the performance of parallel programs

    NASA Technical Reports Server (NTRS)

    Benten, Muhammad S.; Jordan, Harry F.

    1989-01-01

    A programming methodology is introduced that utilizes computational synchronization and avoids tight control flow synchronization in parallel programs. In this methodology, each phase of the computation is assigned a status that can be ready, blocked, or completed, and tasks in each computational phase are self-scheduled to ensure computational progress by the available executing processes. Results indicate that this methodology avoids the catastrophic performance losses resulting from the swapping of processes in multiprogrammed multiprocessors.

  1. Program For Parallel Discrete-Event Simulation

    NASA Technical Reports Server (NTRS)

    Beckman, Brian C.; Blume, Leo R.; Geiselman, John S.; Presley, Matthew T.; Wedel, John J., Jr.; Bellenot, Steven F.; Diloreto, Michael; Hontalas, Philip J.; Reiher, Peter L.; Weiland, Frederick P.

    1991-01-01

    User does not have to add any special logic to aid in synchronization. Time Warp Operating System (TWOS) computer program is special-purpose operating system designed to support parallel discrete-event simulation. Complete implementation of Time Warp mechanism. Supports only simulations and other computations designed for virtual time. Time Warp Simulator (TWSIM) subdirectory contains sequential simulation engine interface-compatible with TWOS. TWOS and TWSIM written in, and support simulations in, C programming language.

  2. Support for Debugging Automatically Parallelized Programs

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Hood, Robert; Biegel, Bryan (Technical Monitor)

    2001-01-01

    We describe a system that simplifies the process of debugging programs produced by computer-aided parallelization tools. The system uses relative debugging techniques to compare serial and parallel executions in order to show where the computations begin to differ. If the original serial code is correct, errors due to parallelization will be isolated by the comparison. One of the primary goals of the system is to minimize the effort required of the user. To that end, the debugging system uses information produced by the parallelization tool to drive the comparison process. In particular the debugging system relies on the parallelization tool to provide information about where variables may have been modified and how arrays are distributed across multiple processes. User effort is also reduced through the use of dynamic instrumentation. This allows us to modify the program execution without changing the way the user builds the executable. The use of dynamic instrumentation also permits us to compare the executions in a fine-grained fashion and only involve the debugger when a difference has been detected. This reduces the overhead of executing instrumentation.

  3. Relative Debugging of Automatically Parallelized Programs

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Hood, Robert; Biegel, Bryan (Technical Monitor)

    2002-01-01

    We describe a system that simplifies the process of debugging programs produced by computer-aided parallelization tools. The system uses relative debugging techniques to compare serial and parallel executions in order to show where the computations begin to differ. If the original serial code is correct, errors due to parallelization will be isolated by the comparison. One of the primary goals of the system is to minimize the effort required of the user. To that end, the debugging system uses information produced by the parallelization tool to drive the comparison process. In particular, the debugging system relies on the parallelization tool to provide information about where variables may have been modified and how arrays are distributed across multiple processes. User effort is also reduced through the use of dynamic instrumentation. This allows us to modify, the program execution with out changing the way the user builds the executable. The use of dynamic instrumentation also permits us to compare the executions in a fine-grained fashion and only involve the debugger when a difference has been detected. This reduces the overhead of executing instrumentation.

  4. Concurrency-based approaches to parallel programming

    SciTech Connect

    Kale, L.V.; Chrisochoides, N.; Kohl, J.

    1995-07-17

    The inevitable transition to parallel programming can be facilitated by appropriate tools, including languages and libraries. After describing the needs of applications developers, this paper presents three specific approaches aimed at development of efficient and reusable parallel software for irregular and dynamic-structured problems. A salient feature of all three approaches in their exploitation of concurrency within a processor. Benefits of individual approaches such as these can be leveraged by an interoperability environment which permits modules written using different approaches to co-exist in single applications.

  5. Concurrency-based approaches to parallel programming

    NASA Technical Reports Server (NTRS)

    Kale, L.V.; Chrisochoides, N.; Kohl, J.; Yelick, K.

    1995-01-01

    The inevitable transition to parallel programming can be facilitated by appropriate tools, including languages and libraries. After describing the needs of applications developers, this paper presents three specific approaches aimed at development of efficient and reusable parallel software for irregular and dynamic-structured problems. A salient feature of all three approaches in their exploitation of concurrency within a processor. Benefits of individual approaches such as these can be leveraged by an interoperability environment which permits modules written using different approaches to co-exist in single applications.

  6. Mapping between parallel processor structures and programs

    NASA Technical Reports Server (NTRS)

    Ngai, Tin-Fook; Yan, Jerry C.; Mak, Victor W. K.; Flynn, Michael J.; Lundstrom, Stephen F.

    1987-01-01

    This paper reports some ongoing research efforts at Stanford in allocation of parallel processing resources. Both processor structures and program structures have their own characteristics. Resource allocation binds the two structures during program execution. The mapping problem determines what processor structure and program structure may be combined to obtain maximum speedup. Three approaches to this mapping problem are considered. Two important factors, granularity and interaction delay, are also considered. A new hierarchical approach to structure definition is outlined. Effective and efficient tools are necessary for the study of the mapping problem. A fast turn-around simulation environment developed for investigating partition strategies for distributed computations and a computationally efficient method to predict performance of parallel processor structures are described.

  7. Using Abstraction in Explicity Parallel Programs.

    DTIC Science & Technology

    1991-07-01

    used as a design language, and our presumption is that programmers can be trained to recognize good designs at the speci- fication level, and then...Definition. Two histories H1 and H 2 are equivalent if for every thread T, HIIT = H21T. 30 When a history contains pending invocations, a complication...instead, parallelism is added to the program level by overlapping the Pi’s, then the periods of low utilization in one P may be masked , effectively

  8. Scheduling parallel programs in distributed systems

    SciTech Connect

    Rommel, C.G.

    1988-01-01

    Scheduling parallel programs under the processor-sharing discipline for uniprocessors, multiprocessors, and distributed systems was studied. Two classes of parallel programs are considered: those without any IPC (called Fork-Join jobs) and those with asynchronous and uniform IPC (called clusters). The study is divided into two parts: (1) develops analytical solutions for Fork-Join Jobs on uniprocessors and multiprocessors; and (2) develops and evaluates via simulation Fork-Join jobs and clusters on distributed systems. The types of site scheduling studied are TS-PS where tasks of a job are scheduled independently at processor-sharing servers, JS-PS in which tasks of a job are scheduled as a single entity at processor-sharing servers, and FCFS where tasks of a job are scheduled independently by order of arrival. For Poisson job arrivals and exponentially distributed task service times, analytical solutions and computationally efficient bounds were found for Fork-Join TS-PS and JS-PS job response times. An algorithm was developed to schedule parallel programs in distributed systems. Over a wide range of parameters the algorithms was found to be superior to both no-load balancing, NLB, and shortest-queue first scheduling, SQF.

  9. Array distribution in data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Sheffler, Thomas J.

    1994-01-01

    We consider distribution at compile time of the array data in a distributed-memory implementation of a data-parallel program written in a language like Fortran 90. We allow dynamic redistribution of data and define a heuristic algorithmic framework that chooses distribution parameters to minimize an estimate of program completion time. We represent the program as an alignment-distribution graph. We propose a divide-and-conquer algorithm for distribution that initially assigns a common distribution to each node of the graph and successively refines this assignment, taking computation, realignment, and redistribution costs into account. We explain how to estimate the effect of distribution on computation cost and how to choose a candidate set of distributions. We present the results of an implementation of our algorithms on several test problems.

  10. XJava: Exploiting Parallelism with Object-Oriented Stream Programming

    NASA Astrophysics Data System (ADS)

    Otto, Frank; Pankratius, Victor; Tichy, Walter F.

    This paper presents the XJava compiler for parallel programs. It exploits parallelism based on an object-oriented stream programming paradigm. XJava extends Java with new parallel constructs that do not expose programmers to low-level details of parallel programming on shared memory machines. Tasks define composable parallel activities, and new operators allow an easier expression of parallel patterns, such as pipelines, divide and conquer, or master/worker. We also present an automatic run-time mechanism that extends our previous work to automatically map tasks and parallel statements to threads.

  11. A Tutorial on Parallel and Concurrent Programming in Haskell

    NASA Astrophysics Data System (ADS)

    Peyton Jones, Simon; Singh, Satnam

    This practical tutorial introduces the features available in Haskell for writing parallel and concurrent programs. We first describe how to write semi-explicit parallel programs by using annotations to express opportunities for parallelism and to help control the granularity of parallelism for effective execution on modern operating systems and processors. We then describe the mechanisms provided by Haskell for writing explicitly parallel programs with a focus on the use of software transactional memory to help share information between threads. Finally, we show how nested data parallelism can be used to write deterministically parallel programs which allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors.

  12. Automatic Generation of Directive-Based Parallel Programs for Shared Memory Parallel Systems

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Yan, Jerry; Frumkin, Michael

    2000-01-01

    The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. As great progress was made in hardware and software technologies, performance of parallel programs with compiler directives has demonstrated large improvement. The introduction of OpenMP directives, the industrial standard for shared-memory programming, has minimized the issue of portability. Due to its ease of programming and its good performance, the technique has become very popular. In this study, we have extended CAPTools, a computer-aided parallelization toolkit, to automatically generate directive-based, OpenMP, parallel programs. We outline techniques used in the implementation of the tool and present test results on the NAS parallel benchmarks and ARC3D, a CFD application. This work demonstrates the great potential of using computer-aided tools to quickly port parallel programs and also achieve good performance.

  13. Programming in Manticore, a Heterogenous Parallel Functional Language

    NASA Astrophysics Data System (ADS)

    Fluet, Matthew; Bergstrom, Lars; Ford, Nic; Rainey, Mike; Reppy, John; Shaw, Adam; Xiao, Yingqi

    The Manticore project is an effort to design and implement a new functional language for parallel programming. Unlike many earlier parallel languages, Manticore is a heterogeneous language that supports parallelism at multiple levels. Specifically, the Manticore language combines Concurrent ML-style explicit concurrency with fine-grain, implicitly threaded, parallel constructs. These lectures will introduce the Manticore language and explore a variety of programs written to take advantage of heterogeneous parallelism.

  14. Parallel Programming Strategies for Irregular Adaptive Applications

    NASA Technical Reports Server (NTRS)

    Biswas, Rupak; Biegel, Bryan (Technical Monitor)

    2001-01-01

    Achieving scalable performance for dynamic irregular applications is eminently challenging. Traditional message-passing approaches have been making steady progress towards this goal; however, they suffer from complex implementation requirements. The use of a global address space greatly simplifies the programming task, but can degrade the performance for such computations. In this work, we examine two typical irregular adaptive applications, Dynamic Remeshing and N-Body, under competing programming methodologies and across various parallel architectures. The Dynamic Remeshing application simulates flow over an airfoil, and refines localized regions of the underlying unstructured mesh. The N-Body experiment models two neighboring Plummer galaxies that are about to undergo a merger. Both problems demonstrate dramatic changes in processor workloads and interprocessor communication with time; thus, dynamic load balancing is a required component.

  15. Flexible Language Constructs for Large Parallel Programs

    DOE PAGES

    Rosing, Matt; Schnabel, Robert

    1994-01-01

    The goal of the research described in this article is to develop flexible language constructs for writing large data parallel numerical programs for distributed memory (multiple instruction multiple data [MIMD]) multiprocessors. Previously, several models have been developed to support synchronization and communication. Models for global synchronization include single instruction multiple data (SIMD), single program multiple data (SPMD), and sequential programs annotated with data distribution statements. The two primary models for communication include implicit communication based on shared memory and explicit communication based on messages. None of these models by themselves seem sufficient to permit the natural and efficient expression ofmore » the variety of algorithms that occur in large scientific computations. In this article, we give an overview of a new language that combines many of these programming models in a clean manner. This is done in a modular fashion such that different models can be combined to support large programs. Within a module, the selection of a model depends on the algorithm and its efficiency requirements. In this article, we give an overview of the language and discuss some of the critical implementation details.« less

  16. Flexible language constructs for large parallel programs

    NASA Technical Reports Server (NTRS)

    Rosing, Matthew; Schnabel, Robert

    1993-01-01

    The goal of the research described is to develop flexible language constructs for writing large data parallel numerical programs for distributed memory (MIMD) multiprocessors. Previously, several models have been developed to support synchronization and communication. Models for global synchronization include SIMD (Single Instruction Multiple Data), SPMD (Single Program Multiple Data), and sequential programs annotated with data distribution statements. The two primary models for communication include implicit communication based on shared memory and explicit communication based on messages. None of these models by themselves seem sufficient to permit the natural and efficient expression of the variety of algorithms that occur in large scientific computations. An overview of a new language that combines many of these programming models in a clean manner is given. This is done in a modular fashion such that different models can be combined to support large programs. Within a module, the selection of a model depends on the algorithm and its efficiency requirements. An overview of the language and discussion of some of the critical implementation details is given.

  17. Grundy: Parallel Processor Architecture Makes Programming Easy

    NASA Astrophysics Data System (ADS)

    Meier, Robert J.

    1985-12-01

    Grundy, an architecture for parallel processing, facilitates the use of high-level languages. In Grundy, several thousand simple processors are dispersed throughout the address space and the concept of machine state is replaced by an invokation frame, a data structure of local variables, program counter, and pointers to superprocesses (parents), subprocesses (children), and concurrent processes (siblings). Each instruction execution consists of five phases. An instruction is fetched, the instruction is decoded, the sources are fetched, the operation is performed, and the destination is written. This breakdown of operations is easily pipelinable. The instruction format of Grundy is completely orthogonal, so Grundy machine code consists of a set of register transfer control bits. The process state pointers are used to collect unused resources such as processors and memory. Joseph Mahon[1] found that as the degree of physical parallelism increases, throughput, including overhead, increases even if extra overhead is needed to split logical processes. As stack pointer, accumulators, and index registers facilitate using high-level languages on conventional computers, pointers to parents, children, and siblings simplify the use of a run-time operating system. The ability to ignore the physical structure of a large number of simple processors supports the use of structured programming. A very simple processor cell allows the replication of approximately 16 32-bit processors on a single Very Large Scale Integration chip. (2M lambda[2]) A bootstrapper and Input/Output channels can be hardwired (using ROM cells and pseudo-processor cells) into a 100 chip computer that is expected to have over 500 procesors, 500K memory, and a network supporting up to 64 concurrent messages between 1000 nodes. These sizes are merely typical and not limits.

  18. Software reuse in parallel programming environments

    SciTech Connect

    Lee, Taejae.

    1989-01-01

    To date, reuse of software has not had its anticipated effect on improvements in software productivity. This is because the concepts behind reusability are not fully understood and because there has been relatively little experimentation with reusability systems. In this research the author attacks these problems in three ways: (1) an investigation of the conceptual foundations of reuse for a parallel programming environment based on the Unified Computation Graph Model designed by Dr. James C. Browne at the University of Texas, Austin; (2) a realization of these concepts in a software base management system, ROPE, to support reuse in such an environment; (3) an experimental evaluation of the effectiveness of ROPE. The research addresses each of the fundamental steps of finding, understanding, modifying, and composing reusable components: (1) the problem of finding components is addressed by a new classification method, called the structured relational classification method; (2) for understanding components, the author has introduced design analysis methods which basically flow from the UCGM model itself; (3) modifying components is addressed in several ways. First through a suitable definition of generic designs and secondly through techniques for composing and decomposing graphs; (4) composition of components is discussed in detail and a framework is laid for a calculus of composition of components. The reusability system ROPE was built, tested and used by a variety of people. Each of the concepts discussed above was realized to some degree in the final system though the theory outstripped the implementation in several areas. This was a very substantial programming project. A fairly extensive evaluation of ROPE was done. The initial set of experiments has clearly established the effectiveness of CODE and ROPE in promoting component reuse in programs of modest size and complexity and in delivery of nearly error-free programs with relatively little effort.

  19. Parallel solution of sparse one-dimensional dynamic programming problems

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1989-01-01

    Parallel computation offers the potential for quickly solving large computational problems. However, it is often a non-trivial task to effectively use parallel computers. Solution methods must sometimes be reformulated to exploit parallelism; the reformulations are often more complex than their slower serial counterparts. We illustrate these points by studying the parallelization of sparse one-dimensional dynamic programming problems, those which do not obviously admit substantial parallelization. We propose a new method for parallelizing such problems, develop analytic models which help us to identify problems which parallelize well, and compare the performance of our algorithm with existing algorithms on a multiprocessor.

  20. Parallel phase model : a programming model for high-end parallel machines with manycores.

    SciTech Connect

    Wu, Junfeng; Wen, Zhaofang; Heroux, Michael Allen; Brightwell, Ronald Brian

    2009-04-01

    This paper presents a parallel programming model, Parallel Phase Model (PPM), for next-generation high-end parallel machines based on a distributed memory architecture consisting of a networked cluster of nodes with a large number of cores on each node. PPM has a unified high-level programming abstraction that facilitates the design and implementation of parallel algorithms to exploit both the parallelism of the many cores and the parallelism at the cluster level. The programming abstraction will be suitable for expressing both fine-grained and coarse-grained parallelism. It includes a few high-level parallel programming language constructs that can be added as an extension to an existing (sequential or parallel) programming language such as C; and the implementation of PPM also includes a light-weight runtime library that runs on top of an existing network communication software layer (e.g. MPI). Design philosophy of PPM and details of the programming abstraction are also presented. Several unstructured applications that inherently require high-volume random fine-grained data accesses have been implemented in PPM with very promising results.

  1. An interactive parallel programming environment applied in atmospheric science

    SciTech Connect

    Laszewski, G. von

    1996-12-31

    This article introduces an interactive parallel programming environment (IPPE) that simplifies the generation and execution of parallel programs. One of the tasks of the environment is to generate message-passing parallel programs for homogeneous and heterogeneous computing platforms. The parallel programs are represented by using visual objects. This is accomplished with the help of a graphical programming editor that is implemented in Java and enables portability to a wide variety of computer platforms. In contrast to other graphical programming systems, reusable parts of the programs can be stored in a program library to support rapid prototyping. In addition, runtime performance data on different computing platforms is collected in a database. A selection process determines dynamically the software and the hardware platform to be used to solve the problem in minimal wall-clock time. The environment is currently being tested on a Grand Challenge problem, the NASA four-dimensional data assimilation system.

  2. An interactive parallel programming environment applied in atmospheric science

    NASA Technical Reports Server (NTRS)

    vonLaszewski, G.

    1996-01-01

    This article introduces an interactive parallel programming environment (IPPE) that simplifies the generation and execution of parallel programs. One of the tasks of the environment is to generate message-passing parallel programs for homogeneous and heterogeneous computing platforms. The parallel programs are represented by using visual objects. This is accomplished with the help of a graphical programming editor that is implemented in Java and enables portability to a wide variety of computer platforms. In contrast to other graphical programming systems, reusable parts of the programs can be stored in a program library to support rapid prototyping. In addition, runtime performance data on different computing platforms is collected in a database. A selection process determines dynamically the software and the hardware platform to be used to solve the problem in minimal wall-clock time. The environment is currently being tested on a Grand Challenge problem, the NASA four-dimensional data assimilation system.

  3. An interactive parallel programming environment applied in atmospheric science

    NASA Technical Reports Server (NTRS)

    vonLaszewski, G.

    1996-01-01

    This article introduces an interactive parallel programming environment (IPPE) that simplifies the generation and execution of parallel programs. One of the tasks of the environment is to generate message-passing parallel programs for homogeneous and heterogeneous computing platforms. The parallel programs are represented by using visual objects. This is accomplished with the help of a graphical programming editor that is implemented in Java and enables portability to a wide variety of computer platforms. In contrast to other graphical programming systems, reusable parts of the programs can be stored in a program library to support rapid prototyping. In addition, runtime performance data on different computing platforms is collected in a database. A selection process determines dynamically the software and the hardware platform to be used to solve the problem in minimal wall-clock time. The environment is currently being tested on a Grand Challenge problem, the NASA four-dimensional data assimilation system.

  4. Programming parallel architectures - The BLAZE family of languages

    NASA Technical Reports Server (NTRS)

    Mehrotra, Piyush

    1989-01-01

    This paper gives an overview of the various approaches to programming multiprocessor architectures that are currently being explored. It is argued that two of these approaches, interactive programming environments and functional parallel languages, are particularly attractive, since they remove much of the burden of exploiting parallel architectures from the user. This paper also describes recent work in the design of parallel languages. Research on languages for both shared and nonshared memory multiprocessors is described.

  5. The BLAZE language: A parallel language for scientific programming

    NASA Technical Reports Server (NTRS)

    Mehrotra, P.; Vanrosendale, J.

    1985-01-01

    A Pascal-like scientific programming language, Blaze, is described. Blaze contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus Blaze should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with onceptually sequential control flow. A central goal in the design of Blaze is portability across a broad range of parallel architectures. The multiple levels of parallelism present in Blaze code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of Blaze are described and shows how this language would be used in typical scientific programming.

  6. The BLAZE language - A parallel language for scientific programming

    NASA Technical Reports Server (NTRS)

    Mehrotra, Piyush; Van Rosendale, John

    1987-01-01

    A Pascal-like scientific programming language, BLAZE, is described. BLAZE contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus BLAZE should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with conceptually sequential control flow. A central goal in the design of BLAZE is portability across a broad range of parallel architectures. The multiple levels of parallelism present in BLAZE code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of BLAZE are described and it is shown how this language would be used in typical scientific programming.

  7. The BLAZE language - A parallel language for scientific programming

    NASA Technical Reports Server (NTRS)

    Mehrotra, Piyush; Van Rosendale, John

    1987-01-01

    A Pascal-like scientific programming language, BLAZE, is described. BLAZE contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus BLAZE should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with conceptually sequential control flow. A central goal in the design of BLAZE is portability across a broad range of parallel architectures. The multiple levels of parallelism present in BLAZE code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of BLAZE are described and it is shown how this language would be used in typical scientific programming.

  8. Directions in parallel programming: HPF, shared virtual memory and object parallelism in pC++

    NASA Technical Reports Server (NTRS)

    Bodin, Francois; Priol, Thierry; Mehrotra, Piyush; Gannon, Dennis

    1994-01-01

    Fortran and C++ are the dominant programming languages used in scientific computation. Consequently, extensions to these languages are the most popular for programming massively parallel computers. We discuss two such approaches to parallel Fortran and one approach to C++. The High Performance Fortran Forum has designed HPF with the intent of supporting data parallelism on Fortran 90 applications. HPF works by asking the user to help the compiler distribute and align the data structures with the distributed memory modules in the system. Fortran-S takes a different approach in which the data distribution is managed by the operating system and the user provides annotations to indicate parallel control regions. In the case of C++, we look at pC++ which is based on a concurrent aggregate parallel model.

  9. MPISH : a parallel shell for MPI programs.

    SciTech Connect

    Desai, M.; Lusk, A.; Bradshaw, R.; Lusk, E.

    2006-01-01

    While previous work has shown MPI to provide capabilities for system software, actual adoption has not widely occurred. We discuss process management shortcomings in MPI implementations and their impact on MPI usability for system software and management tasks. We introduce MPISH, a parallel shell designed to address these issues.

  10. Language constructs and runtime systems for compositional parallel programming

    SciTech Connect

    Foster, I.; Kesselman, C.

    1995-03-01

    In task-parallel programs, diverse activities can take place concurrently, and communication and synchronization patterns are complex and not easily predictable. Previous work has identified compositionality as an important design principle for task-parallel programs. In this paper, we discuss alternative approaches to the realization of this principle. We first provide a review and critical analysis of Strand, an early compositional programming language. We examine the strengths of the Strand approach and also its weaknesses, which we attribute primarily to the use of a specialized language. Then, we present an alternative programming language framework that overcomes these weaknesses. This framework uses simple extensions to existing sequential languages (C++ and Fortran) and a common runtime system to provide a basis for the construction of large, task-parallel programs. We also discuss the runtime system techniques required to support these languages on parallel and distributed computer systems.

  11. Programming parallel architectures: The BLAZE family of languages

    NASA Technical Reports Server (NTRS)

    Mehrotra, Piyush

    1988-01-01

    Programming multiprocessor architectures is a critical research issue. An overview is given of the various approaches to programming these architectures that are currently being explored. It is argued that two of these approaches, interactive programming environments and functional parallel languages, are particularly attractive since they remove much of the burden of exploiting parallel architectures from the user. Also described is recent work by the author in the design of parallel languages. Research on languages for both shared and nonshared memory multiprocessors is described, as well as the relations of this work to other current language research projects.

  12. Automated Scalability Analysis Tools for Message Passing Parallel Programs

    NASA Technical Reports Server (NTRS)

    Sarukkai, Sekhar R.; Mehra, Pankaj; Tucker, Deanne (Technical Monitor)

    1994-01-01

    In order to develop scalable parallel applications, a number of programming decisions have to be made during the development of the program. Performance tools that help in making these decisions are few, if existent. Traditionally, performance tools have focused on exposing performance bottlenecks of small-scale executions of the program. However, it is common knowledge that programs that perform exceptionally well on small processor configurations, more often than not, perform poorly when executed on larger processor configurations. Hence, new tools that predict the execution characteristics of scaled-up programs are an essential part of an application developers toolkit. In this paper we discuss important issues that need to be considered in order to build useful scalability analysis tools for parallel programs. We introduce a simple tool that automatically extracts scalability characteristics of a class of deterministic parallel programs. We show with the help of a number of results on the Intel iPSC/860, that predictions are within reasonable bounds.

  13. Architectural Adaptability in Parallel Programming via Control Abstraction

    DTIC Science & Technology

    1991-01-01

    Technical Report 359 January 1991 Abstract Parallel programming involves finding the potential parallelism in an application, choos - ing an...during the development of this paper. 34 References [Albert et ai, 1988] Eugene Albert, Kathleen Knobe, Joan D. Lukas, and Guy L. Steele, Jr

  14. Combined And-Or Parallel Execution of Logic Programs,

    DTIC Science & Technology

    1988-03-01

    siuom on Logic Programming,", Atlantic City, NJ, 1984. pp. 2-11. [BL86] R. Butler, E. L. Lusk, R. Olson and R. Overbeek . -ANLNVA."I A Parallel...80’-88. [DLOS71 T. Disz. E. LDisk, and P. Overbeek , "Experiments with OR-parallel Logic Prograi<’ ,. In 1987 IEEE Inicrnational Symrposium7 in. Logic

  15. Multilist Scheduling. A New Parallel Programming Model.

    DTIC Science & Technology

    1993-07-30

    fluid simulation [531; differential equation solving such as weather prediction [24, 25]; digital circuit simulation such as gate-level simulation [201...Champaign, 1986. [53] Johnson, C. Numerical Solutions of Partial Differential Equations by the Finite Element Method. Cambridge University Press, 1987. 131...Ortega, J. and Voigt, R. Solution of Partial Differential Equations on Vector and Parallel Computers. SIAM Review, vol. 27 (1985), pp. 149-240. [73

  16. A Programming Environment for Parallel Vision Algorithms

    DTIC Science & Technology

    1990-04-11

    linear parallel speedup. Many appli- cations for the image processing pipeline (including tracking, color histograrmning, feature detection, frame-rate...pure logic. For example, a language based on algebra of real numbers might treat constraints such as "X = Y + Z", "X = Y x Z", and so on as primitives. A...however, time for a more usable version of the language. A front end processor is therefore being written to parse expressions written in an algebraic

  17. Programming Probabilistic Structural Analysis for Parallel Processing Computer

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Chen, Heh-Chyun; Twisdale, Lawrence A.; Chamis, Christos C.; Murthy, Pappu L. N.

    1991-01-01

    The ultimate goal of this research program is to make Probabilistic Structural Analysis (PSA) computationally efficient and hence practical for the design environment by achieving large scale parallelism. The paper identifies the multiple levels of parallelism in PSA, identifies methodologies for exploiting this parallelism, describes the development of a parallel stochastic finite element code, and presents results of two example applications. It is demonstrated that speeds within five percent of those theoretically possible can be achieved. A special-purpose numerical technique, the stochastic preconditioned conjugate gradient method, is also presented and demonstrated to be extremely efficient for certain classes of PSA problems.

  18. The FORCE: A highly portable parallel programming language

    NASA Technical Reports Server (NTRS)

    Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger

    1989-01-01

    Here, it is explained why the FORCE parallel programming language is easily portable among six different shared-memory microprocessors, and how a two-level macro preprocessor makes it possible to hide low level machine dependencies and to build machine-independent high level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared memory multiprocessor executing them.

  19. The FORCE - A highly portable parallel programming language

    NASA Technical Reports Server (NTRS)

    Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger

    1989-01-01

    This paper explains why the FORCE parallel programming language is easily portable among six different shared-memory multiprocessors, and how a two-level macro preprocessor makes it possible to hide low-level machine dependencies and to build machine-independent high-level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared-memory multiprocessor executing them.

  20. The FORCE - A highly portable parallel programming language

    NASA Technical Reports Server (NTRS)

    Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger

    1989-01-01

    This paper explains why the FORCE parallel programming language is easily portable among six different shared-memory multiprocessors, and how a two-level macro preprocessor makes it possible to hide low-level machine dependencies and to build machine-independent high-level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared-memory multiprocessor executing them.

  1. Integrated Task And Data Parallel Programming: Language Design

    NASA Technical Reports Server (NTRS)

    Grimshaw, Andrew S.; West, Emily A.

    1998-01-01

    his research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers '95 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program m. Additional 1995 Activities During the fall I collaborated

  2. Characterizing and Mitigating Work Time Inflation in Task Parallel Programs

    DOE PAGES

    Olivier, Stephen L.; de Supinski, Bronis R.; Schulz, Martin; ...

    2013-01-01

    Task parallelism raises the level of abstraction in shared memory parallel programming to simplify the development of complex applications. However, task parallel applications can exhibit poor performance due to thread idleness, scheduling overheads, and work time inflation – additional time spent by threads in a multithreaded computation beyond the time required to perform the same work in a sequential computation. We identify the contributions of each factor to lost efficiency in various task parallel OpenMP applications and diagnose the causes of work time inflation in those applications. Increased data access latency can cause significant work time inflation in NUMAmore » systems. Our locality framework for task parallel OpenMP programs mitigates this cause of work time inflation. Our extensions to the Qthreads library demonstrate that locality-aware scheduling can improve performance up to 3X compared to the Intel OpenMP task scheduler.« less

  3. NavP: Structured and Multithreaded Distributed Parallel Programming

    NASA Technical Reports Server (NTRS)

    Pan, Lei; Xu, Jingling

    2006-01-01

    This slide presentation reviews some of the issues around distributed parallel programming. It compares and contrast two methods of programming: Single Program Multiple Data (SPMD) with the Navigational Programming (NAVP). It then reviews the distributed sequential computing (DSC) method and the methodology of NavP. Case studies are presented. It also reviews the work that is being done to enable the NavP system.

  4. Development of massively parallel quantum chemistry program SMASH

    SciTech Connect

    Ishimura, Kazuya

    2015-12-31

    A massively parallel program for quantum chemistry calculations SMASH was released under the Apache License 2.0 in September 2014. The SMASH program is written in the Fortran90/95 language with MPI and OpenMP standards for parallelization. Frequently used routines, such as one- and two-electron integral calculations, are modularized to make program developments simple. The speed-up of the B3LYP energy calculation for (C{sub 150}H{sub 30}){sub 2} with the cc-pVDZ basis set (4500 basis functions) was 50,499 on 98,304 cores of the K computer.

  5. Development of massively parallel quantum chemistry program SMASH

    NASA Astrophysics Data System (ADS)

    Ishimura, Kazuya

    2015-12-01

    A massively parallel program for quantum chemistry calculations SMASH was released under the Apache License 2.0 in September 2014. The SMASH program is written in the Fortran90/95 language with MPI and OpenMP standards for parallelization. Frequently used routines, such as one- and two-electron integral calculations, are modularized to make program developments simple. The speed-up of the B3LYP energy calculation for (C150H30)2 with the cc-pVDZ basis set (4500 basis functions) was 50,499 on 98,304 cores of the K computer.

  6. What a Parallel Programming Language Has to Let You Say,

    DTIC Science & Technology

    1984-09-01

    algorithms I’r the (A’I is the design of such communications protocols. .Somuetinm the r4pre,,entH’ an object that the programer %ishes to treat ...running Scheme program. thousands of different Schcme prog.’ams can he executed in parallel at the cost of a moderate constant factor slow-down. (A...RD-fl147 854 WHAT A PARALLEL PROGRAMMING LANGUAGE HAS TO LET YOU SAY 1/1 (U) MASSACHUSETTS INST OF TECH CAMBRIDGE ARTIFICIAL INTELLIGENCE LAB A

  7. Web Based Parallel Programming Workshop for Undergraduate Education.

    ERIC Educational Resources Information Center

    Marcus, Robert L.; Robertson, Douglass

    Central State University (Ohio), under a contract with Nichols Research Corporation, has developed a World Wide web based workshop on high performance computing entitled "IBN SP2 Parallel Programming Workshop." The research is part of the DoD (Department of Defense) High Performance Computing Modernization Program. The research…

  8. Deadlock and fictitiousness problem in parallel program specifications

    SciTech Connect

    Panfilenko, V.P.

    1995-05-01

    One of the directions of modern programming based on algebraic methods takes its origin in V.M. Glushkov`s theory of systems of algorithmic algebras (SAA). The SAA apparatus with appropriately interpreted operations is used for program design and allows compact structured representation of program schemas in the form of algebraic formulas. Modified systems of algorithmic algebras (SAA-M) additionally represent parallelism description tools.

  9. Exploiting loop level parallelism in nonprocedural dataflow programs

    NASA Technical Reports Server (NTRS)

    Gokhale, Maya B.

    1987-01-01

    Discussed are how loop level parallelism is detected in a nonprocedural dataflow program, and how a procedural program with concurrent loops is scheduled. Also discussed is a program restructuring technique which may be applied to recursive equations so that concurrent loops may be generated for a seemingly iterative computation. A compiler which generates C code for the language described below has been implemented. The scheduling component of the compiler and the restructuring transformation are described.

  10. Incremental Parallelization of Non-Data-Parallel Programs Using the Charon Message-Passing Library

    NASA Technical Reports Server (NTRS)

    VanderWijngaart, Rob F.

    2000-01-01

    Message passing is among the most popular techniques for parallelizing scientific programs on distributed-memory architectures. The reasons for its success are wide availability (MPI), efficiency, and full tuning control provided to the programmer. A major drawback, however, is that incremental parallelization, as offered by compiler directives, is not generally possible, because all data structures have to be changed throughout the program simultaneously. Charon remedies this situation through mappings between distributed and non-distributed data. It allows breaking up the parallelization into small steps, guaranteeing correctness at every stage. Several tools are available to help convert legacy codes into high-performance message-passing programs. They usually target data-parallel applications, whose loops carrying most of the work can be distributed among all processors without much dependency analysis. Others do a full dependency analysis and then convert the code virtually automatically. Even more toolkits are available that aid construction from scratch of message passing programs. None, however, allows piecemeal translation of codes with complex data dependencies (i.e. non-data-parallel programs) into message passing codes. The Charon library (available in both C and Fortran) provides incremental parallelization capabilities by linking legacy code arrays with distributed arrays. During the conversion process, non-distributed and distributed arrays exist side by side, and simple mapping functions allow the programmer to switch between the two in any location in the program. Charon also provides wrapper functions that leave the structure of the legacy code intact, but that allow execution on truly distributed data. Finally, the library provides a rich set of communication functions that support virtually all patterns of remote data demands in realistic structured grid scientific programs, including transposition, nearest-neighbor communication, pipelining

  11. The parallel programming of voluntary and reflexive saccades.

    PubMed

    Walker, Robin; McSorley, Eugene

    2006-06-01

    A novel two-step paradigm was used to investigate the parallel programming of consecutive, stimulus-elicited ('reflexive') and endogenous ('voluntary') saccades. The mean latency of voluntary saccades, made following the first reflexive saccades in two-step conditions, was significantly reduced compared to that of voluntary saccades made in the single-step control trials. The latency of the first reflexive saccades was modulated by the requirement to make a second saccade: first saccade latency increased when a second voluntary saccade was required in the opposite direction to the first saccade, and decreased when a second saccade was required in the same direction as the first reflexive saccade. A second experiment confirmed the basic effect and also showed that a second reflexive saccade may be programmed in parallel with a first voluntary saccade. The results support the view that voluntary and reflexive saccades can be programmed in parallel on a common motor map.

  12. Development of LGA & LBE 2D Parallel Programs

    NASA Astrophysics Data System (ADS)

    Ujita, Hiroshi; Nagata, Satoru; Akiyama, Minoru; Naitoh, Masanori; Ohashi, Hirotada

    A lattice-gas Automata two-dimensional program was developed for analysis of single and two-phase flow behaviors, to support the development of integrated software modules for Nuclear Power Plant mechanistic simulations. The program has single-color, which includes FHP I, II, and III models, two-color (Immiscible lattice gas), and two-velocity methods including a gravity effect model. Parameter surveys have been performed for Karman vortex street, two-phase separation for understanding flow regimes, and natural circulation flow for demonstrating passive reactor safety due to the chimney structure vessel. In addition, lattice-Boltzmann Equation two-dimensional programs were also developed. For analyzing single-phase flow behavior, a lattice-Boltzmann-BGK program was developed, which has multi-block treatments. A Finite Differential lattice-Boltzmann Equation program of parallelized version was introduced to analyze boiling two-phase flow behaviors. Parameter surveys have been performed for backward facing flow, Karman vortex street, bent piping flow with/without obstacles for piping system applications, flow in the porous media for demonstrating porous debris coolability, Couette flow, and spinodal decomposition to understand basic phase separation mechanisms. Parallelization was completed by using a domain decomposition method for all of the programs. An increase in calculation speed of at least 25 times, by parallel processing on 32 processors, demonstrated high parallelization efficiency. Application fields for microscopic model simulation to hypothetical severe conditions in large plants were also discussed.

  13. Monitors, messages, and clusters : the p4 parallel programming system.

    SciTech Connect

    Butler, R. M.; Lusk, E. T.; Mathematics and Computer Science; Univ. of North Florida

    1994-01-01

    p4 is a portable library of C and Fortran subroutines for programming parallel computers. It is the current version of a system that has been in use since 1984. It includes features for explicit parallel programming of shared-memory machines, distributed-memory machines (including heterogeneous networks of workstations), and clusters, by which we mean shared-memory multiprocessors communicating via message passing. We discuss here the design goals, history, and system architecture of p4 and describe briefly a diverse collection of applications that have demonstrated the utility of p4.

  14. Parallelization of Program to Optimize Simulated Trajectories (POST3D)

    NASA Technical Reports Server (NTRS)

    Hammond, Dana P.; Korte, John J. (Technical Monitor)

    2001-01-01

    This paper describes the parallelization of the Program to Optimize Simulated Trajectories (POST3D). POST3D uses a gradient-based optimization algorithm that reaches an optimum design point by moving from one design point to the next. The gradient calculations required to complete the optimization process, dominate the computational time and have been parallelized using a Single Program Multiple Data (SPMD) on a distributed memory NUMA (non-uniform memory access) architecture. The Origin2000 was used for the tests presented.

  15. Execution models for mapping programs onto distributed memory parallel computers

    NASA Technical Reports Server (NTRS)

    Sussman, Alan

    1992-01-01

    The problem of exploiting the parallelism available in a program to efficiently employ the resources of the target machine is addressed. The problem is discussed in the context of building a mapping compiler for a distributed memory parallel machine. The paper describes using execution models to drive the process of mapping a program in the most efficient way onto a particular machine. Through analysis of the execution models for several mapping techniques for one class of programs, we show that the selection of the best technique for a particular program instance can make a significant difference in performance. On the other hand, the results of benchmarks from an implementation of a mapping compiler show that our execution models are accurate enough to select the best mapping technique for a given program.

  16. Center for Programming Models for Scalable Parallel Computing

    SciTech Connect

    John Mellor-Crummey

    2008-02-29

    Rice University's achievements as part of the Center for Programming Models for Scalable Parallel Computing include: (1) design and implemention of cafc, the first multi-platform CAF compiler for distributed and shared-memory machines, (2) performance studies of the efficiency of programs written using the CAF and UPC programming models, (3) a novel technique to analyze explicitly-parallel SPMD programs that facilitates optimization, (4) design, implementation, and evaluation of new language features for CAF, including communication topologies, multi-version variables, and distributed multithreading to simplify development of high-performance codes in CAF, and (5) a synchronization strength reduction transformation for automatically replacing barrier-based synchronization with more efficient point-to-point synchronization. The prototype Co-array Fortran compiler cafc developed in this project is available as open source software from http://www.hipersoft.rice.edu/caf.

  17. Modelling parallel programs and multiprocessor architectures with AXE

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Fineman, Charles E.

    1991-01-01

    AXE, An Experimental Environment for Parallel Systems, was designed to model and simulate for parallel systems at the process level. It provides an integrated environment for specifying computation models, multiprocessor architectures, data collection, and performance visualization. AXE is being used at NASA-Ames for developing resource management strategies, parallel problem formulation, multiprocessor architectures, and operating system issues related to the High Performance Computing and Communications Program. AXE's simple, structured user-interface enables the user to model parallel programs and machines precisely and efficiently. Its quick turn-around time keeps the user interested and productive. AXE models multicomputers. The user may easily modify various architectural parameters including the number of sites, connection topologies, and overhead for operating system activities. Parallel computations in AXE are represented as collections of autonomous computing objects known as players. Their use and behavior is described. Performance data of the multiprocessor model can be observed on a color screen. These include CPU and message routing bottlenecks, and the dynamic status of the software.

  18. Efficient Thread Labeling for Monitoring Programs with Nested Parallelism

    NASA Astrophysics Data System (ADS)

    Ha, Ok-Kyoon; Kim, Sun-Sook; Jun, Yong-Kee

    It is difficult and cumbersome to detect data races occurred in an execution of parallel programs. Any on-the-fly race detection techniques using Lamport's happened-before relation needs a thread labeling scheme for generating unique identifiers which maintain logical concurrency information for the parallel threads. NR labeling is an efficient thread labeling scheme for the fork-join program model with nested parallelism, because its efficiency depends only on the nesting depth for every fork and join operation. This paper presents an improved NR labeling, called e-NR labeling, in which every thread generates its label by inheriting the pointer to its ancestor list from the parent threads or by updating the pointer in a constant amount of time and space. This labeling is more efficient than the NR labeling, because its efficiency does not depend on the nesting depth for every fork and join operation. Some experiments were performed with OpenMP programs having nesting depths of three or four and maximum parallelisms varying from 10,000 to 1,000,000. The results show that e-NR is 5 times faster than NR labeling and 4.3 times faster than OS labeling in the average time for creating and maintaining the thread labels. In average space required for labeling, it is 3.5 times smaller than NR labeling and 3 times smaller than OS labeling.

  19. Advanced parallel programming models research and development opportunities.

    SciTech Connect

    Wen, Zhaofang.; Brightwell, Ronald Brian

    2004-07-01

    There is currently a large research and development effort within the high-performance computing community on advanced parallel programming models. This research can potentially have an impact on parallel applications, system software, and computing architectures in the next several years. Given Sandia's expertise and unique perspective in these areas, particularly on very large-scale systems, there are many areas in which Sandia can contribute to this effort. This technical report provides a survey of past and present parallel programming model research projects and provides a detailed description of the Partitioned Global Address Space (PGAS) programming model. The PGAS model may offer several improvements over the traditional distributed memory message passing model, which is the dominant model currently being used at Sandia. This technical report discusses these potential benefits and outlines specific areas where Sandia's expertise could contribute to current research activities. In particular, we describe several projects in the areas of high-performance networking, operating systems and parallel runtime systems, compilers, application development, and performance evaluation.

  20. Testing New Programming Paradigms with NAS Parallel Benchmarks

    NASA Technical Reports Server (NTRS)

    Jin, H.; Frumkin, M.; Schultz, M.; Yan, J.

    2000-01-01

    Over the past decade, high performance computing has evolved rapidly, not only in hardware architectures but also with increasing complexity of real applications. Technologies have been developing to aim at scaling up to thousands of processors on both distributed and shared memory systems. Development of parallel programs on these computers is always a challenging task. Today, writing parallel programs with message passing (e.g. MPI) is the most popular way of achieving scalability and high performance. However, writing message passing programs is difficult and error prone. Recent years new effort has been made in defining new parallel programming paradigms. The best examples are: HPF (based on data parallelism) and OpenMP (based on shared memory parallelism). Both provide simple and clear extensions to sequential programs, thus greatly simplify the tedious tasks encountered in writing message passing programs. HPF is independent of memory hierarchy, however, due to the immaturity of compiler technology its performance is still questionable. Although use of parallel compiler directives is not new, OpenMP offers a portable solution in the shared-memory domain. Another important development involves the tremendous progress in the internet and its associated technology. Although still in its infancy, Java promisses portability in a heterogeneous environment and offers possibility to "compile once and run anywhere." In light of testing these new technologies, we implemented new parallel versions of the NAS Parallel Benchmarks (NPBs) with HPF and OpenMP directives, and extended the work with Java and Java-threads. The purpose of this study is to examine the effectiveness of alternative programming paradigms. NPBs consist of five kernels and three simulated applications that mimic the computation and data movement of large scale computational fluid dynamics (CFD) applications. We started with the serial version included in NPB2.3. Optimization of memory and cache usage

  1. Testing New Programming Paradigms with NAS Parallel Benchmarks

    NASA Technical Reports Server (NTRS)

    Jin, H.; Frumkin, M.; Schultz, M.; Yan, J.

    2000-01-01

    Over the past decade, high performance computing has evolved rapidly, not only in hardware architectures but also with increasing complexity of real applications. Technologies have been developing to aim at scaling up to thousands of processors on both distributed and shared memory systems. Development of parallel programs on these computers is always a challenging task. Today, writing parallel programs with message passing (e.g. MPI) is the most popular way of achieving scalability and high performance. However, writing message passing programs is difficult and error prone. Recent years new effort has been made in defining new parallel programming paradigms. The best examples are: HPF (based on data parallelism) and OpenMP (based on shared memory parallelism). Both provide simple and clear extensions to sequential programs, thus greatly simplify the tedious tasks encountered in writing message passing programs. HPF is independent of memory hierarchy, however, due to the immaturity of compiler technology its performance is still questionable. Although use of parallel compiler directives is not new, OpenMP offers a portable solution in the shared-memory domain. Another important development involves the tremendous progress in the internet and its associated technology. Although still in its infancy, Java promisses portability in a heterogeneous environment and offers possibility to "compile once and run anywhere." In light of testing these new technologies, we implemented new parallel versions of the NAS Parallel Benchmarks (NPBs) with HPF and OpenMP directives, and extended the work with Java and Java-threads. The purpose of this study is to examine the effectiveness of alternative programming paradigms. NPBs consist of five kernels and three simulated applications that mimic the computation and data movement of large scale computational fluid dynamics (CFD) applications. We started with the serial version included in NPB2.3. Optimization of memory and cache usage

  2. Performance Evaluation Methodologies and Tools for Massively Parallel Programs

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Sarukkai, Sekhar; Tucker, Deanne (Technical Monitor)

    1994-01-01

    The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessors. However, without effective means to monitor (and analyze) program execution, tuning the performance of parallel programs becomes exponentially difficult as program complexity and machine size increase. The recent introduction of performance tuning tools from various supercomputer vendors (Intel's ParAide, TMC's PRISM, CSI'S Apprentice, and Convex's CXtrace) seems to indicate the maturity of performance tool technologies and vendors'/customers' recognition of their importance. However, a few important questions remain: What kind of performance bottlenecks can these tools detect (or correct)? How time consuming is the performance tuning process? What are some important technical issues that remain to be tackled in this area? This workshop reviews the fundamental concepts involved in analyzing and improving the performance of parallel and heterogeneous message-passing programs. Several alternative strategies will be contrasted, and for each we will describe how currently available tuning tools (e.g., AIMS, ParAide, PRISM, Apprentice, CXtrace, ATExpert, Pablo, IPS-2)) can be used to facilitate the process. We will characterize the effectiveness of the tools and methodologies based on actual user experiences at NASA Ames Research Center. Finally, we will discuss their limitations and outline recent approaches taken by vendors and the research community to address them.

  3. Performance Evaluation Methodologies and Tools for Massively Parallel Programs

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Sarukkai, Sekhar; Tucker, Deanne (Technical Monitor)

    1994-01-01

    The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessors. However, without effective means to monitor (and analyze) program execution, tuning the performance of parallel programs becomes exponentially difficult as program complexity and machine size increase. The recent introduction of performance tuning tools from various supercomputer vendors (Intel's ParAide, TMC's PRISM, CSI'S Apprentice, and Convex's CXtrace) seems to indicate the maturity of performance tool technologies and vendors'/customers' recognition of their importance. However, a few important questions remain: What kind of performance bottlenecks can these tools detect (or correct)? How time consuming is the performance tuning process? What are some important technical issues that remain to be tackled in this area? This workshop reviews the fundamental concepts involved in analyzing and improving the performance of parallel and heterogeneous message-passing programs. Several alternative strategies will be contrasted, and for each we will describe how currently available tuning tools (e.g., AIMS, ParAide, PRISM, Apprentice, CXtrace, ATExpert, Pablo, IPS-2)) can be used to facilitate the process. We will characterize the effectiveness of the tools and methodologies based on actual user experiences at NASA Ames Research Center. Finally, we will discuss their limitations and outline recent approaches taken by vendors and the research community to address them.

  4. Methodologies and Tools for Tuning Parallel Programs: Facts and Fantasies

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Lum, Henry, Jr. (Technical Monitor)

    1994-01-01

    The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessors. However, without effective means to monitor (and analyze) program execution, tuning the performance of parallel programs becomes exponentially difficult as program complexity and machine size increase. The recent introduction of performance tuning tools from various supercomputer vendors (Intel's ParAide, TMC's PRISM, CRI's Apprentice, and Convex's CXtrace) seems to indicate the maturity of performance tool technologies and vendors'/customers' recognition of their importance. However, a few important questions remain: What kind of performance bottlenecks can these tools detect (or correct)? How time consuming is the performance tuning process? What are some important technical issues that remain to be tackled in this area? This workshop reviews the fundamental concepts involved in analyzing and improving the performance of parallel and heterogeneous message-passing programs. Several alternative strategies will be contrasted, and for each we will describe how currently available tuning tools (e.g. AIMS, ParAide, PRISM, Apprentice, CXtrace, ATExpert, Pablo, IPS-2) can be used to facilitate the process. We will characterize the effectiveness of the tools and methodologies based on actual user experiences at NASA Ames Research Center. Finally, we will discuss their limitations and outline recent approaches taken by vendors and the research community to address them.

  5. Methodologies and Tools for Tuning Parallel Programs: Facts and Fantasies

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Lum, Henry, Jr. (Technical Monitor)

    1994-01-01

    The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessors. However, without effective means to monitor (and analyze) program execution, tuning the performance of parallel programs becomes exponentially difficult as program complexity and machine size increase. The recent introduction of performance tuning tools from various supercomputer vendors (Intel's ParAide, TMC's PRISM, CRI's Apprentice, and Convex's CXtrace) seems to indicate the maturity of performance tool technologies and vendors'/customers' recognition of their importance. However, a few important questions remain: What kind of performance bottlenecks can these tools detect (or correct)? How time consuming is the performance tuning process? What are some important technical issues that remain to be tackled in this area? This workshop reviews the fundamental concepts involved in analyzing and improving the performance of parallel and heterogeneous message-passing programs. Several alternative strategies will be contrasted, and for each we will describe how currently available tuning tools (e.g. AIMS, ParAide, PRISM, Apprentice, CXtrace, ATExpert, Pablo, IPS-2) can be used to facilitate the process. We will characterize the effectiveness of the tools and methodologies based on actual user experiences at NASA Ames Research Center. Finally, we will discuss their limitations and outline recent approaches taken by vendors and the research community to address them.

  6. Final Report: Center for Programming Models for Scalable Parallel Computing

    SciTech Connect

    Mellor-Crummey, John

    2011-09-13

    As part of the Center for Programming Models for Scalable Parallel Computing, Rice University collaborated with project partners in the design, development and deployment of language, compiler, and runtime support for parallel programming models to support application development for the “leadership-class” computer systems at DOE national laboratories. Work over the course of this project has focused on the design, implementation, and evaluation of a second-generation version of Coarray Fortran. Research and development efforts of the project have focused on the CAF 2.0 language, compiler, runtime system, and supporting infrastructure. This has involved working with the teams that provide infrastructure for CAF that we rely on, implementing new language and runtime features, producing an open source compiler that enabled us to evaluate our ideas, and evaluating our design and implementation through the use of benchmarks. The report details the research, development, findings, and conclusions from this work.

  7. On the utility of threads for data parallel programming

    NASA Technical Reports Server (NTRS)

    Fahringer, Thomas; Haines, Matthew; Mehrotra, Piyush

    1995-01-01

    Threads provide a useful programming model for asynchronous behavior because of their ability to encapsulate units of work that can then be scheduled for execution at runtime, based on the dynamic state of a system. Recently, the threaded model has been applied to the domain of data parallel scientific codes, and initial reports indicate that the threaded model can produce performance gains over non-threaded approaches, primarily through the use of overlapping useful computation with communication latency. However, overlapping computation with communication is possible without the benefit of threads if the communication system supports asynchronous primitives, and this comparison has not been made in previous papers. This paper provides a critical look at the utility of lightweight threads as applied to data parallel scientific programming.

  8. MELD: A Logical Approach to Distributed and Parallel Programming

    DTIC Science & Technology

    2012-03-01

    extremely successful. A recent success story is the MapReduce programming model, which can be viewed as a somewhat more generalized version of the data...parallel model that is optimized for large scale clusters. In MapReduce , the data sharing and scheduling model is very simple: the computation for...models than MapReduce , but they do not allow the programmer to specify scheduling strategies or support formal proof techniques. Hellerstein’s group

  9. VPC - A Proposal for a Vector Parallel C Programming Language.

    DTIC Science & Technology

    1987-10-30

    181 B. Kernighan and D. Ritchie. Th~e C Programming Language. Prentice-11all, 1978. [91 B. Kernighan and R. Pike. The Unix Programming Environment...designed to be an extended version of the C language as defined by Kernighan and Ritchie (Ref. 8). Rather than taking the approach of extending...basis. Unix is a trademark of AT&T Bell Laboratories. e .’r % 7% The Vector Parallel C Language 3 tion calls that activate the FX/8’s proprietary

  10. MLP: A Parallel Programming Alternative to MPI for New Shared Memory Parallel Systems

    NASA Technical Reports Server (NTRS)

    Taft, James R.

    1999-01-01

    Recent developments at the NASA AMES Research Center's NAS Division have demonstrated that the new generation of NUMA based Symmetric Multi-Processing systems (SMPs), such as the Silicon Graphics Origin 2000, can successfully execute legacy vector oriented CFD production codes at sustained rates far exceeding processing rates possible on dedicated 16 CPU Cray C90 systems. This high level of performance is achieved via shared memory based Multi-Level Parallelism (MLP). This programming approach, developed at NAS and outlined below, is distinct from the message passing paradigm of MPI. It offers parallelism at both the fine and coarse grained level, with communication latencies that are approximately 50-100 times lower than typical MPI implementations on the same platform. Such latency reductions offer the promise of performance scaling to very large CPU counts. The method draws on, but is also distinct from, the newly defined OpenMP specification, which uses compiler directives to support a limited subset of multi-level parallel operations. The NAS MLP method is general, and applicable to a large class of NASA CFD codes.

  11. An informal introduction to program transformation and parallel processors

    SciTech Connect

    Hopkins, K.W.

    1994-08-01

    In the summer of 1992, I had the opportunity to participate in a Faculty Research Program at Argonne National Laboratory. I worked under Dr. Jim Boyle on a project transforming code written in pure functional Lisp to Fortran code to run on distributed-memory parallel processors. To perform this project, I had to learn three things: the transformation system, the basics of distributed-memory parallel machines, and the Lisp programming language. Each of these topics in computer science was unfamiliar to me as a mathematician, but I found that they (especially parallel processing) are greatly impacting many fields of mathematics and science. Since most mathematicians have some exposure to computers, but.certainly are not computer scientists, I felt it was appropriate to write a paper summarizing my introduction to these areas and how they can fit together. This paper is not meant to be a full explanation of the topics, but an informal introduction for the ``mathematical layman.`` I place myself in that category as well as my previous use of computers was as a classroom demonstration tool.

  12. A Comparison of Shared Memory Parallel Programming Models

    SciTech Connect

    Mogill, Jace A; Haglin, David J

    2010-05-24

    The dominant parallel programming models for shared memory computers, Pthreads and OpenMP, are both thread-centric in that they are based on explicit management of tasks and enforce data dependencies and output ordering through task management. By comparison, the Cray XMT programming model is data-centric where the primary concern of the programmer is managing data dependencies, allowing threads to progress in a data flow fashion. The XMT implements this programming model by associating tag bits with each word of memory, affording efficient fine grained data synchronization independent of the number of processors or how tasks are scheduled. When task management is implicit and synchronization is abundant, efficient, and easy to use, programmers have viable alternatives to traditional thread-centric algorithms. In this paper we compare the amount of available parallelism relative to the amount of work in a variety of different algorithms and data structures when synchronization does not need to be rationed, as well as identify opportunities for platform and performance portability of the data-centric programming model on multi-core processors.

  13. Users manual for the Chameleon parallel programming tools

    SciTech Connect

    Gropp, W.; Smith, B.

    1993-06-01

    Message passing is a common method for writing programs for distributed-memory parallel computers. Unfortunately, the lack of a standard for message passing has hampered the construction of portable and efficient parallel programs. In an attempt to remedy this problem, a number of groups have developed their own message-passing systems, each with its own strengths and weaknesses. Chameleon is a second-generation system of this type. Rather than replacing these existing systems, Chameleon is meant to supplement them by providing a uniform way to access many of these systems. Chameleon`s goals are to (a) be very lightweight (low over-head), (b) be highly portable, and (c) help standardize program startup and the use of emerging message-passing operations such as collective operations on subsets of processors. Chameleon also provides a way to port programs written using PICL or Intel NX message passing to other systems, including collections of workstations. Chameleon is tracking the Message-Passing Interface (MPI) draft standard and will provide both an MPI implementation and an MPI transport layer. Chameleon provides support for heterogeneous computing by using p4 and PVM. Chameleon`s support for homogeneous computing includes the portable libraries p4, PICL, and PVM and vendor-specific implementation for Intel NX, IBM EUI (SP-1), and Thinking Machines CMMD (CM-5). Support for Ncube and PVM 3.x is also under development.

  14. Automated Performance Prediction of Message-Passing Parallel Programs

    NASA Technical Reports Server (NTRS)

    Block, Robert J.; Sarukkai, Sekhar; Mehra, Pankaj; Woodrow, Thomas S. (Technical Monitor)

    1995-01-01

    The increasing use of massively parallel supercomputers to solve large-scale scientific problems has generated a need for tools that can predict scalability trends of applications written for these machines. Much work has been done to create simple models that represent important characteristics of parallel programs, such as latency, network contention, and communication volume. But many of these methods still require substantial manual effort to represent an application in the model's format. The NIK toolkit described in this paper is the result of an on-going effort to automate the formation of analytic expressions of program execution time, with a minimum of programmer assistance. In this paper we demonstrate the feasibility of our approach, by extending previous work to detect and model communication patterns automatically, with and without overlapped computations. The predictions derived from these models agree, within reasonable limits, with execution times of programs measured on the Intel iPSC/860 and Paragon. Further, we demonstrate the use of MK in selecting optimal computational grain size and studying various scalability metrics.

  15. NavP: Structured and Multithreaded Distributed Parallel Programming

    NASA Technical Reports Server (NTRS)

    Pan, Lei

    2007-01-01

    We present Navigational Programming (NavP) -- a distributed parallel programming methodology based on the principles of migrating computations and multithreading. The four major steps of NavP are: (1) Distribute the data using the data communication pattern in a given algorithm; (2) Insert navigational commands for the computation to migrate and follow large-sized distributed data; (3) Cut the sequential migrating thread and construct a mobile pipeline; and (4) Loop back for refinement. NavP is significantly different from the current prevailing Message Passing (MP) approach. The advantages of NavP include: (1) NavP is structured distributed programming and it does not change the code structure of an original algorithm. This is in sharp contrast to MP as MP implementations in general do not resemble the original sequential code; (2) NavP implementations are always competitive with the best MPI implementations in terms of performance. Approaches such as DSM or HPF have failed to deliver satisfying performance as of today in contrast, even if they are relatively easy to use compared to MP; (3) NavP provides incremental parallelization, which is beyond the reach of MP; and (4) NavP is a unifying approach that allows us to exploit both fine- (multithreading on shared memory) and coarse- (pipelined tasks on distributed memory) grained parallelism. This is in contrast to the currently popular hybrid use of MP+OpenMP, which is known to be complex to use. We present experimental results that demonstrate the effectiveness of NavP.

  16. NavP: Structured and Multithreaded Distributed Parallel Programming

    NASA Technical Reports Server (NTRS)

    Pan, Lei

    2007-01-01

    We present Navigational Programming (NavP) -- a distributed parallel programming methodology based on the principles of migrating computations and multithreading. The four major steps of NavP are: (1) Distribute the data using the data communication pattern in a given algorithm; (2) Insert navigational commands for the computation to migrate and follow large-sized distributed data; (3) Cut the sequential migrating thread and construct a mobile pipeline; and (4) Loop back for refinement. NavP is significantly different from the current prevailing Message Passing (MP) approach. The advantages of NavP include: (1) NavP is structured distributed programming and it does not change the code structure of an original algorithm. This is in sharp contrast to MP as MP implementations in general do not resemble the original sequential code; (2) NavP implementations are always competitive with the best MPI implementations in terms of performance. Approaches such as DSM or HPF have failed to deliver satisfying performance as of today in contrast, even if they are relatively easy to use compared to MP; (3) NavP provides incremental parallelization, which is beyond the reach of MP; and (4) NavP is a unifying approach that allows us to exploit both fine- (multithreading on shared memory) and coarse- (pipelined tasks on distributed memory) grained parallelism. This is in contrast to the currently popular hybrid use of MP+OpenMP, which is known to be complex to use. We present experimental results that demonstrate the effectiveness of NavP.

  17. A scalable parallel algorithm for multiple objective linear programs

    NASA Technical Reports Server (NTRS)

    Wiecek, Malgorzata M.; Zhang, Hong

    1994-01-01

    This paper presents an ADBASE-based parallel algorithm for solving multiple objective linear programs (MOLP's). Job balance, speedup and scalability are of primary interest in evaluating efficiency of the new algorithm. Implementation results on Intel iPSC/2 and Paragon multiprocessors show that the algorithm significantly speeds up the process of solving MOLP's, which is understood as generating all or some efficient extreme points and unbounded efficient edges. The algorithm gives specially good results for large and very large problems. Motivation and justification for solving such large MOLP's are also included.

  18. Parallelization and checkpointing of GPU applications through program transformation

    SciTech Connect

    Solano-Quinde, Lizandro Damian

    2012-01-01

    GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and

  19. Start/Pat; A parallel-programming toolkit

    SciTech Connect

    Appelbe, B.; Smith, K. ); McDowell, C. )

    1989-07-01

    How can you make Fortran code parallel without isolating the programmer from learning to understand and exploit parallelism effectively. With an interactive toolkit that automates parallelization as it educates. This paper discusses the Start/Pat toolkit.

  20. A Programming Model Performance Study Using the NAS Parallel Benchmarks

    DOE PAGES

    Shan, Hongzhang; Blagojević, Filip; Min, Seung-Jai; ...

    2010-01-01

    Harnessing the power of multicore platforms is challenging due to the additional levels of parallelism present. In this paper we use the NAS Parallel Benchmarks to study three programming models, MPI, OpenMP and PGAS to understand their performance and memory usage characteristics on current multicore architectures. To understand these characteristics we use the Integrated Performance Monitoring tool and other ways to measure communication versus computation time, as well as the fraction of the run time spent in OpenMP. The benchmarks are run on two different Cray XT5 systems and an Infiniband cluster. Our results show that in general the threemore » programming models exhibit very similar performance characteristics. In a few cases, OpenMP is significantly faster because it explicitly avoids communication. For these particular cases, we were able to re-write the UPC versions and achieve equal performance to OpenMP. Using OpenMP was also the most advantageous in terms of memory usage. Also we compare performance differences between the two Cray systems, which have quad-core and hex-core processors. We show that at scale the performance is almost always slower on the hex-core system because of increased contention for network resources.« less

  1. An empirical study of FORTRAN programs for parallelizing compilers

    NASA Technical Reports Server (NTRS)

    Shen, Zhiyu; Li, Zhiyuan; Yew, Pen-Chung

    1990-01-01

    Some results are reported from an empirical study of program characteristics that are important in parallelizing compiler writers, especially in the area of data dependence analysis and program transformations. The state of the art in data dependence analysis and some parallel execution techniques are examined. The major findings are included. Many subscripts contain symbolic terms with unknown values. A few methods of determining their values at compile time are evaluated. Array references with coupled subscripts appear quite frequently; these subscripts must be handled simultaneously in a dependence test, rather than being handled separately as in current test algorithms. Nonzero coefficients of loop indexes in most subscripts are found to be simple: they are either 1 or -1. This allows an exact real-valued test to be as accurate as an exact integer-valued test for one-dimensional or two-dimensional arrays. Dependencies with uncertain distance are found to be rather common, and one of the main reasons is the frequent appearance of symbolic terms with unknown values.

  2. An empirical study of FORTRAN programs for parallelizing compilers

    NASA Technical Reports Server (NTRS)

    Shen, Zhiyu; Li, Zhiyuan; Yew, Pen-Chung

    1990-01-01

    Some results are reported from an empirical study of program characteristics that are important in parallelizing compiler writers, especially in the area of data dependence analysis and program transformations. The state of the art in data dependence analysis and some parallel execution techniques are examined. The major findings are included. Many subscripts contain symbolic terms with unknown values. A few methods of determining their values at compile time are evaluated. Array references with coupled subscripts appear quite frequently; these subscripts must be handled simultaneously in a dependence test, rather than being handled separately as in current test algorithms. Nonzero coefficients of loop indexes in most subscripts are found to be simple: they are either 1 or -1. This allows an exact real-valued test to be as accurate as an exact integer-valued test for one-dimensional or two-dimensional arrays. Dependencies with uncertain distance are found to be rather common, and one of the main reasons is the frequent appearance of symbolic terms with unknown values.

  3. Automated Performance Prediction of Message Passing Parallel Programs

    NASA Technical Reports Server (NTRS)

    Block, Robert J.; Mehra, Pankaj; Sarukkai, Sekhar; Woodrow, Thomas (Technical Monitor)

    1994-01-01

    As the trend toward massively parallel processing continues, the need for tools that can predict scalability trends becomes greater. While high level languages Eke HPF have come into greater use, explicit message-passing programs proliferate, and will probably do so for some time, thanks to the onslaught of standards such as MPI. Much work has been done to create simple models that represent important characteristics of parallel programs, such as latency, network contention, and communication volume. But many of these methods still require a substantial manual effort to represent an application in the model's format. The YAPP ("Yet Another Performance Predictor") tool is an attempt to automate the formation of first-order expressions for completion time, with a minimum of programmer assistance. The content of this paper is as follows: First, we explore the implementation details of YAPP, and illustrate with examples some of the reasons that automatic prediction is difficult. In the following sections, we present the results of four applications, using execution traces on the Intel i860, analyze the error in YAPP's predictions, explain the limitations of our implementation, and mention possible future additions. In particular, we illustrate techniques used to identify pipeline communication patterns, and demonstrate how compiler analysis and regression are combined to automate the prediction process.

  4. Automated Performance Prediction of Message Passing Parallel Programs

    NASA Technical Reports Server (NTRS)

    Block, Robert J.; Mehra, Pankaj; Sarukkai, Sekhar; Woodrow, Thomas (Technical Monitor)

    1994-01-01

    As the trend toward massively parallel processing continues, the need for tools that can predict scalability trends becomes greater. While high level languages Eke HPF have come into greater use, explicit message-passing programs proliferate, and will probably do so for some time, thanks to the onslaught of standards such as MPI. Much work has been done to create simple models that represent important characteristics of parallel programs, such as latency, network contention, and communication volume. But many of these methods still require a substantial manual effort to represent an application in the model's format. The YAPP ("Yet Another Performance Predictor") tool is an attempt to automate the formation of first-order expressions for completion time, with a minimum of programmer assistance. The content of this paper is as follows: First, we explore the implementation details of YAPP, and illustrate with examples some of the reasons that automatic prediction is difficult. In the following sections, we present the results of four applications, using execution traces on the Intel i860, analyze the error in YAPP's predictions, explain the limitations of our implementation, and mention possible future additions. In particular, we illustrate techniques used to identify pipeline communication patterns, and demonstrate how compiler analysis and regression are combined to automate the prediction process.

  5. Center for Programming Models for Scalable Parallel Computing: Future Programming Models

    SciTech Connect

    Gao, Guang, R.

    2008-07-24

    The mission of the pmodel center project is to develop software technology to support scalable parallel programming models for terascale systems. The goal of the specific UD subproject is in the context developing an efficient and robust methodology and tools for HPC programming. More specifically, the focus is on developing new programming models which facilitate programmers in porting their application onto parallel high performance computing systems. During the course of the research in the past 5 years, the landscape of microprocessor chip architecture has witnessed a fundamental change – the emergence of multi-core/many-core chip architecture appear to become the mainstream technology and will have a major impact to for future generation parallel machines. The programming model for shared-address space machines is becoming critical to such multi-core architectures. Our research highlight is the in-depth study of proposed fine-grain parallelism/multithreading support on such future generation multi-core architectures. Our research has demonstrated the significant impact such fine-grain multithreading model can have on the productivity of parallel programming models and their efficient implementation.

  6. A Parallel Vector Machine for the PM Programming Language

    NASA Astrophysics Data System (ADS)

    Bellerby, Tim

    2016-04-01

    PM is a new programming language which aims to make the writing of computational geoscience models on parallel hardware accessible to scientists who are not themselves expert parallel programmers. It is based around the concept of communicating operators: language constructs that enable variables local to a single invocation of a parallelised loop to be viewed as if they were arrays spanning the entire loop domain. This mechanism enables different loop invocations (which may or may not be executing on different processors) to exchange information in a manner that extends the successful Communicating Sequential Processes idiom from single messages to collective communication. Communicating operators avoid the additional synchronisation mechanisms, such as atomic variables, required when programming using the Partitioned Global Address Space (PGAS) paradigm. Using a single loop invocation as the fundamental unit of concurrency enables PM to uniformly represent different levels of parallelism from vector operations through shared memory systems to distributed grids. This paper describes an implementation of PM based on a vectorised virtual machine. On a single processor node, concurrent operations are implemented using masked vector operations. Virtual machine instructions operate on vectors of values and may be unmasked, masked using a Boolean field, or masked using an array of active vector cell locations. Conditional structures (such as if-then-else or while statement implementations) calculate and apply masks to the operations they control. A shift in mask representation from Boolean to location-list occurs when active locations become sufficiently sparse. Parallel loops unfold data structures (or vectors of data structures for nested loops) into vectors of values that may additionally be distributed over multiple computational nodes and then split into micro-threads compatible with the size of the local cache. Inter-node communication is accomplished using

  7. Programming a massively parallel, computation universal system: Static behavior

    NASA Astrophysics Data System (ADS)

    Lapedes, Alan; Farber, Robert

    1986-08-01

    Massively parallel systems are presently the focus of intense interest for a variety of reasons. A key problem is how to control, or ``program'' these systems. In previous work by the authors, the ``optimum finding'' properties of Hopfield neural nets were applied to the nets themselves to create a ``neural compiler.'' This was done in such a way that the problem of programming the attractors of one neural net (called the Slave net) was expressed as an optimization problem that was in turn solved by a second neural net (the Master net). The procedure is effective and efficient. In this series of papers we extend that approach to programming nets that contain interneurons (sometimes called ``hidden neurons''), and thus we deal with nets capable of universal computation. Our work is closely related to recent work of Rummelhart et al. (also Parker, and LeChun), which may be viewed as a special case of this formalism and therefore of ``computing with attractors.'' In later papers in this series, we present the theory for programming time dependent behavior, and consider practical implementations. One may expect numerous applications in view of the computation universality of these networks.

  8. Parallel functional programming in Sisal: Fictions, facts, and future

    SciTech Connect

    McGraw, J.R.

    1993-07-01

    This paper provides a status report on the progress of research and development on the functional language Sisal. This project focuses on providing a highly effective method of writing large scientific applications that can efficiently execute on a spectrum of different multiprocessors. The paper includes sections on the language definition, compilation strategies, and programming techniques intended for readers with little or no background with Sisal. The section on performance presents our most recent results on execution speed for shared-memory multiprocessors, our findings using Sisal to develop codes, and our experiences migrating the same source code to different machines. For large programs, the execution performance of Sisal (with minimal supporting advice from the programmer) usually exceeds that of the best available automatic, vector/parallel Fortran compilers. Our evidence also indicates that Sisal programs tend to be shorter in length, faster to write, and dearer to understand than equivalent algorithms in Fortran. The paper concludes with a substantial discussion of common criticisms of the language and our plans for addressing them. Most notably, efficient implementations for distributed memory machines are lacking; an issue we plan to remedy.

  9. Digital signal processor and programming system for parallel signal processing

    SciTech Connect

    Van den Bout, D.E.

    1987-01-01

    This thesis describes an integrated assault upon the problem of designing high-throughput, low-cost digital signal-processing systems. The dual prongs of this assault consist of: (1) the design of a digital signal processor (DSP) which efficiently executes signal-processing algorithms in either a uniprocessor or multiprocessor configuration, (2) the PaLS programming system which accepts an arbitrary algorithm, partitions it across a group of DSPs, synthesizes an optimal communication link topology for the DSPs, and schedules the partitioned algorithm upon the DSPs. The results of applying a new quasi-dynamic analysis technique to a set of high-level signal-processing algorithms were used to determine the uniprocessor features of the DSP design. For multiprocessing applications, the DSP contains an interprocessor communications port (IPC) which supports simple, flexible, dataflow communications while allowing the total communication bandwidth to be incrementally allocated to achieve the best link utilization. The net result is a DSP with a simple architecture that is easy to program for both uniprocessor and multi-processor modes of operation. The PaLS programming system simplifies the task of parallelizing an algorithm for execution upon a multiprocessor built with the DSP.

  10. Performance Measurement, Visualization and Modeling of Parallel and Distributed Programs

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Sarukkai, Sekhar R.; Mehra, Pankaj; Lum, Henry, Jr. (Technical Monitor)

    1994-01-01

    This paper presents a methodology for debugging the performance of message-passing programs on both tightly coupled and loosely coupled distributed-memory machines. The AIMS (Automated Instrumentation and Monitoring System) toolkit, a suite of software tools for measurement and analysis of performance, is introduced and its application illustrated using several benchmark programs drawn from the field of computational fluid dynamics. AIMS includes (i) Xinstrument, a powerful source-code instrumentor, which supports both Fortran77 and C as well as a number of different message-passing libraries including Intel's NX Thinking Machines' CMMD, and PVM; (ii) Monitor, a library of timestamping and trace -collection routines that run on supercomputers (such as Intel's iPSC/860, Delta, and Paragon and Thinking Machines' CM5) as well as on networks of workstations (including Convex Cluster and SparcStations connected by a LAN); (iii) Visualization Kernel, a trace-animation facility that supports source-code clickback, simultaneous visualization of computation and communication patterns, as well as analysis of data movements; (iv) Statistics Kernel, an advanced profiling facility, that associates a variety of performance data with various syntactic components of a parallel program; (v) Index Kernel, a diagnostic tool that helps pinpoint performance bottlenecks through the use of abstract indices; (vi) Modeling Kernel, a facility for automated modeling of message-passing programs that supports both simulation -based and analytical approaches to performance prediction and scalability analysis; (vii) Intrusion Compensator, a utility for recovering true performance from observed performance by removing the overheads of monitoring and their effects on the communication pattern of the program; and (viii) Compatibility Tools, that convert AIMS-generated traces into formats used by other performance-visualization tools, such as ParaGraph, Pablo, and certain AVS/Explorer modules.

  11. Performance Measurement, Visualization and Modeling of Parallel and Distributed Programs

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Sarukkai, Sekhar R.; Mehra, Pankaj; Lum, Henry, Jr. (Technical Monitor)

    1994-01-01

    This paper presents a methodology for debugging the performance of message-passing programs on both tightly coupled and loosely coupled distributed-memory machines. The AIMS (Automated Instrumentation and Monitoring System) toolkit, a suite of software tools for measurement and analysis of performance, is introduced and its application illustrated using several benchmark programs drawn from the field of computational fluid dynamics. AIMS includes (i) Xinstrument, a powerful source-code instrumentor, which supports both Fortran77 and C as well as a number of different message-passing libraries including Intel's NX Thinking Machines' CMMD, and PVM; (ii) Monitor, a library of timestamping and trace -collection routines that run on supercomputers (such as Intel's iPSC/860, Delta, and Paragon and Thinking Machines' CM5) as well as on networks of workstations (including Convex Cluster and SparcStations connected by a LAN); (iii) Visualization Kernel, a trace-animation facility that supports source-code clickback, simultaneous visualization of computation and communication patterns, as well as analysis of data movements; (iv) Statistics Kernel, an advanced profiling facility, that associates a variety of performance data with various syntactic components of a parallel program; (v) Index Kernel, a diagnostic tool that helps pinpoint performance bottlenecks through the use of abstract indices; (vi) Modeling Kernel, a facility for automated modeling of message-passing programs that supports both simulation -based and analytical approaches to performance prediction and scalability analysis; (vii) Intrusion Compensator, a utility for recovering true performance from observed performance by removing the overheads of monitoring and their effects on the communication pattern of the program; and (viii) Compatibility Tools, that convert AIMS-generated traces into formats used by other performance-visualization tools, such as ParaGraph, Pablo, and certain AVS/Explorer modules.

  12. Exploiting variability for energy optimization of parallel programs

    SciTech Connect

    Lavrijsen, Wim; Iancu, Costin; de Jong, Wibe; Chen, Xin; Schwan, Karsten

    2016-04-18

    Here in this paper we present optimizations that use DVFS mechanisms to reduce the total energy usage in scientific applications. Our main insight is that noise is intrinsic to large scale parallel executions and it appears whenever shared resources are contended. The presence of noise allows us to identify and manipulate any program regions amenable to DVFS. When compared to previous energy optimizations that make per core decisions using predictions of the running time, our scheme uses a qualitative approach to recognize the signature of executions amenable to DVFS. By recognizing the "shape of variability" we can optimize codes with highly dynamic behavior, which pose challenges to all existing DVFS techniques. We validate our approach using offline and online analyses for one-sided and two-sided communication paradigms. We have applied our methods to NWChem, and we show best case improvements in energy use of 12% at no loss in performance when using online optimizations running on 720 Haswell cores with one-sided communication. With NWChem on MPI two-sided and offline analysis, capturing the initialization, we find energy savings of up to 20%, with less than 1% performance cost.

  13. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems.

    PubMed

    Stone, John E; Gohara, David; Shi, Guochun

    2010-05-01

    We provide an overview of the key architectural features of recent microprocessor designs and describe the programming model and abstractions provided by OpenCL, a new parallel programming standard targeting these architectures.

  14. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems

    PubMed Central

    Stone, John E.; Gohara, David; Shi, Guochun

    2010-01-01

    We provide an overview of the key architectural features of recent microprocessor designs and describe the programming model and abstractions provided by OpenCL, a new parallel programming standard targeting these architectures. PMID:21037981

  15. A parallel dynamic programming algorithm for multi-reservoir system optimization

    NASA Astrophysics Data System (ADS)

    Li, Xiang; Wei, Jiahua; Li, Tiejian; Wang, Guangqian; Yeh, William W.-G.

    2014-05-01

    This paper develops a parallel dynamic programming algorithm to optimize the joint operation of a multi-reservoir system. First, a multi-dimensional dynamic programming (DP) model is formulated for a multi-reservoir system. Second, the DP algorithm is parallelized using a peer-to-peer parallel paradigm. The parallelization is based on the distributed memory architecture and the message passing interface (MPI) protocol. We consider both the distributed computing and distributed computer memory in the parallelization. The parallel paradigm aims at reducing the computation time as well as alleviating the computer memory requirement associated with running a multi-dimensional DP model. Next, we test the parallel DP algorithm on the classic, benchmark four-reservoir problem on a high-performance computing (HPC) system with up to 350 cores. Results indicate that the parallel DP algorithm exhibits good performance in parallel efficiency; the parallel DP algorithm is scalable and will not be restricted by the number of cores. Finally, the parallel DP algorithm is applied to a real-world, five-reservoir system in China. The results demonstrate the parallel efficiency and practical utility of the proposed methodology.

  16. Portable programming on parallel/networked computers using the Application Portable Parallel Library (APPL)

    NASA Technical Reports Server (NTRS)

    Quealy, Angela; Cole, Gary L.; Blech, Richard A.

    1993-01-01

    The Application Portable Parallel Library (APPL) is a subroutine-based library of communication primitives that is callable from applications written in FORTRAN or C. APPL provides a consistent programmer interface to a variety of distributed and shared-memory multiprocessor MIMD machines. The objective of APPL is to minimize the effort required to move parallel applications from one machine to another, or to a network of homogeneous machines. APPL encompasses many of the message-passing primitives that are currently available on commercial multiprocessor systems. This paper describes APPL (version 2.3.1) and its usage, reports the status of the APPL project, and indicates possible directions for the future. Several applications using APPL are discussed, as well as performance and overhead results.

  17. Parallel logic programming and parallel systems software and hardware. Progress report (Final), 1 April 1988-31 March 1989

    SciTech Connect

    Minker, J.

    1989-07-29

    This progress report summarizes work performed under AFOSR-88-0152 on parallel logic programming, problem solving, and deductive data bases. A parallel problem-solving system, PRISM (Parallel Inference System), that was implemented on McMOB was ported to the BBN Butterfly machine. Two versions of PRISM were developed and are operational on the Butterfly: a message-passing ring-structure system and a shared-memory system. Experimental testing of PRISM on McMOB continued, while experiments were also conducted on the Butterfly systems. Three enhancements were made and completed during the grant period. These are: a capability to handle negated queries and a capability to assert and retract statements. In addition to the above, work continued in the area of informative answers to queries in deductive data bases. A thesis was completed on the subject. An interpreter was developed and is running, that can take restricted natural language as input and can respond with a cooperative natural language output. In the area of parallel software development, the following were accomplished. Theoretical work on slicing/splicing was completed. Tools were provided for software development using artificial-intelligence techniques. AI software for massively parallel architectures was started.

  18. Retargeting of existing FORTRAN program and development of parallel compilers

    NASA Technical Reports Server (NTRS)

    Agrawal, Dharma P.

    1988-01-01

    The software models used in implementing the parallelizing compiler for the B-HIVE multiprocessor system are described. The various models and strategies used in the compiler development are: flexible granularity model, which allows a compromise between two extreme granularity models; communication model, which is capable of precisely describing the interprocessor communication timings and patterns; loop type detection strategy, which identifies different types of loops; critical path with coloring scheme, which is a versatile scheduling strategy for any multicomputer with some associated communication costs; and loop allocation strategy, which realizes optimum overlapped operations between computation and communication of the system. Using these models, several sample routines of the AIR3D package are examined and tested. It may be noted that automatically generated codes are highly parallelized to provide the maximized degree of parallelism, obtaining the speedup up to a 28 to 32-processor system. A comparison of parallel codes for both the existing and proposed communication model, is performed and the corresponding expected speedup factors are obtained. The experimentation shows that the B-HIVE compiler produces more efficient codes than existing techniques. Work is progressing well in completing the final phase of the compiler. Numerous enhancements are needed to improve the capabilities of the parallelizing compiler.

  19. Parallel implementation of a Monte Carlo molecular stimulation program

    PubMed

    Carvalho; Gomes; Cordeiro

    2000-05-01

    Molecular simulation methods such as molecular dynamics and Monte Carlo are fundamental for the theoretical calculation of macroscopic and microscopic properties of chemical and biochemical systems. These methods often rely on heavy computations, and one sometimes feels the need to run them in powerful massively parallel machines. For moderate problem sizes, however, a not so powerful and less expensive solution based on a network of workstations may be quite satisfactory. In the present work, the strategy adopted in the development of a parallel version is outlined, using the message passing model, of a molecular simulation code to be used in a network of workstations. This parallel code is the adaptation of an older sequential code using the Metropolis Monte Carlo method. In this case, the message passing interface was used as the interprocess communications library, although the code could be easily adapted for other message passing systems such as the parallel virtual machine. For simple systems it is shown that speedups of 2 can be achieved for four processes with this cheap solution. For bigger and more complex simulated systems, even better speedups might be obtained, which indicates that the presented approach is appropriate for the efficient use of a network of workstations in parallel processing.

  20. Accelerate Performance on the Parallel Programming Super Highway

    DTIC Science & Technology

    2010-04-01

    WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) National Instruments ,11500 N Mopac Expwy,Austin,TX,78759-3504 8. PERFORMING...standards) National  Instruments  LabVIEW Agilent VEE Northwoods Software Sanscript Many others… 19 Market is demanding smaller, cheaper, faster targets...need  or  multiple development  tools 21 Increasingly parallel embedded hardware warrants  new methods of parallel software development Dataflow

  1. A C++ Thread Package for Concurrent and Parallel Programming

    SciTech Connect

    Jie Chen; William Watson

    1999-11-01

    Recently thread libraries have become a common entity on various operating systems such as Unix, Windows NT and VxWorks. Those thread libraries offer significant performance enhancement by allowing applications to use multiple threads running either concurrently or in parallel on multiprocessors. However, the incompatibilities between native libraries introduces challenges for those who wish to develop portable applications.

  2. PRAM C:a new programming environment for fine-grain and coarse-grain parallelism.

    SciTech Connect

    Brown, Jonathan Leighton; Wen, Zhaofang.

    2004-11-01

    In the search for ''good'' parallel programming environments for Sandia's current and future parallel architectures, they revisit a long-standing open question. Can the PRAM parallel algorithms designed by theoretical computer scientists over the last two decades be implemented efficiently? This open question has co-existed with ongoing efforts in the HPC community to develop practical parallel programming models that can simultaneously provide ease of use, expressiveness, performance, and scalability. Unfortunately, no single model has met all these competing requirements. Here they propose a parallel programming environment, PRAM C, to bridge the gap between theory and practice. This is an attempt to provide an affirmative answer to the PRAM question, and to satisfy these competing practical requirements. This environment consists of a new thin runtime layer and an ANSI C extension. The C extension has two control constructs and one additional data type concept, ''shared''. This C extension should enable easy translation from PRAM algorithms to real parallel programs, much like the translation from sequential algorithms to C programs. The thin runtime layer bundles fine-grained communication requests into coarse-grained communication to be served by message-passing. Although the PRAM represents SIMD-style fine-grained parallelism, a stand-alone PRAM C environment can support both fine-grained and coarse-grained parallel programming in either a MIMD or SPMD style, interoperate with existing MPI libraries, and use existing hardware. The PRAM C model can also be integrated easily with existing models. Unlike related efforts proposing innovative hardware with the goal to realize the PRAM, ours can be a pure software solution with the purpose to provide a practical programming environment for existing parallel machines; it also has the potential to perform well on future parallel architectures.

  3. Programming environment for parallel-vision algorithms. Final technical report, February 1988-December 1989

    SciTech Connect

    Brown, C.

    1990-04-11

    This contract developed and disseminated papers, ideas, algorithms, analysis, software, applications, and implementations for parallel programming environments for computer vision and for vision applications. The work has been widely reported and highly influential. The most significant work centered on the Butterfly Parallel Processor, the MaxVideo pipelined parallel image processor, and the development of the real-time computer vision laboratory. For the Butterfly, the Psyche multi-model operating system was developed and the CONSUL autoparallelizing compiler was designed. Much basic and influential performance monitoring and debugging work was completed, resulting in working systems and novel algorithms. There was also significant research in systems and applications using other parallel architectures in the laboratory, such as the MaxVideo parallel pipelined image processor. The contract developed a heterogeneous parallel architecture involving pipelined and MIMD parallelism and integrated it with a robot head.

  4. Object-Oriented NeuroSys: Parallel Programs for Simulating Large Networks of Biologically Accurate Neurons

    SciTech Connect

    Pacheco, P; Miller, P; Kim, J; Leese, T; Zabiyaka, Y

    2003-05-07

    Object-oriented NeuroSys (ooNeuroSys) is a collection of programs for simulating very large networks of biologically accurate neurons on distributed memory parallel computers. It includes two principle programs: ooNeuroSys, a parallel program for solving the large systems of ordinary differential equations arising from the interconnected neurons, and Neurondiz, a parallel program for visualizing the results of ooNeuroSys. Both programs are designed to be run on clusters and use the MPI library to obtain parallelism. ooNeuroSys also includes an easy-to-use Python interface. This interface allows neuroscientists to quickly develop and test complex neuron models. Both ooNeuroSys and Neurondiz have a design that allows for both high performance and relative ease of maintenance.

  5. Communications oriented programming of parallel iterative solutions of sparse linear systems

    NASA Technical Reports Server (NTRS)

    Patrick, M. L.; Pratt, T. W.

    1986-01-01

    Parallel algorithms are developed for a class of scientific computational problems by partitioning the problems into smaller problems which may be solved concurrently. The effectiveness of the resulting parallel solutions is determined by the amount and frequency of communication and synchronization and the extent to which communication can be overlapped with computation. Three different parallel algorithms for solving the same class of problems are presented, and their effectiveness is analyzed from this point of view. The algorithms are programmed using a new programming environment. Run-time statistics and experience obtained from the execution of these programs assist in measuring the effectiveness of these algorithms.

  6. Adapting high-level language programs for parallel processing using data flow

    NASA Technical Reports Server (NTRS)

    Standley, Hilda M.

    1988-01-01

    EASY-FLOW, a very high-level data flow language, is introduced for the purpose of adapting programs written in a conventional high-level language to a parallel environment. The level of parallelism provided is of the large-grained variety in which parallel activities take place between subprograms or processes. A program written in EASY-FLOW is a set of subprogram calls as units, structured by iteration, branching, and distribution constructs. A data flow graph may be deduced from an EASY-FLOW program.

  7. CRBLASTER: a fast parallel-processing program for cosmic ray rejection

    NASA Astrophysics Data System (ADS)

    Mighell, Kenneth J.

    2008-08-01

    Many astronomical image-analysis programs are based on algorithms that can be described as being embarrassingly parallel, where the analysis of one subimage generally does not affect the analysis of another subimage. Yet few parallel-processing astrophysical image-analysis programs exist that can easily take full advantage of todays fast multi-core servers costing a few thousands of dollars. A major reason for the shortage of state-of-the-art parallel-processing astrophysical image-analysis codes is that the writing of parallel codes has been perceived to be difficult. I describe a new fast parallel-processing image-analysis program called crblaster which does cosmic ray rejection using van Dokkum's L.A.Cosmic algorithm. crblaster is written in C using the industry standard Message Passing Interface (MPI) library. Processing a single 800×800 HST WFPC2 image takes 1.87 seconds using 4 processes on an Apple Xserve with two dual-core 3.0-GHz Intel Xeons; the efficiency of the program running with the 4 processors is 82%. The code can be used as a software framework for easy development of parallel-processing image-anlaysis programs using embarrassing parallel algorithms; the biggest required modification is the replacement of the core image processing function with an alternative image-analysis function based on a single-processor algorithm. I describe the design, implementation and performance of the program.

  8. Buffered coscheduling for parallel programming and enhanced fault tolerance

    DOEpatents

    Petrini, Fabrizio; Feng, Wu-chun

    2006-01-31

    A computer implemented method schedules processor jobs on a network of parallel machine processors or distributed system processors. Control information communications generated by each process performed by each processor during a defined time interval is accumulated in buffers, where adjacent time intervals are separated by strobe intervals for a global exchange of control information. A global exchange of the control information communications at the end of each defined time interval is performed during an intervening strobe interval so that each processor is informed by all of the other processors of the number of incoming jobs to be received by each processor in a subsequent time interval. The buffered coscheduling method of this invention also enhances the fault tolerance of a network of parallel machine processors or distributed system processors

  9. A data-driven parallel execution model and architecture for logic programs

    SciTech Connect

    Tseng, Chien-Chao.

    1989-01-01

    Logic Programming has come to prominence in recent years after the decision of the Japanese Fifth Generation Project to adopt it as the kernel language. A significant number of research projects are attempting to implement different schemes to exploit the inherent parallelism in logic programs. Data flow architectural model has been found to attractive for parallel execution of logic programs. In this research, five dataflow execution models available in literature, have been critically reviewed. The primary aim of the critical review was to establish a set of design issues critical to efficient execution. Based on the established design issues, the abstract date - driven machine model, names LogDf, is developed for parallel execution of logic programs. The execution scheme supports OR - parallelism, Restricted AND parallelism and stream parallelism. Multiple binding environments are represented using stream of streams structure (S-stream). Eager evaluation is performed by passing binding environment between subgoal literals as S-streams, which are formed using non-strict constructors. The hierarchical multi-level stream structure provides a logical framework for distributing the streams to enhance parallelism in production/consumption as well as control of parallelism. The scheme for compiling the dataflow graphs, developed in this thesis, eliminates the necessity of any operand matching unit in the underlying dynamic dataflow architecture. In this thesis, an architecture for the abstract machine LogDf is also provided and the performance evaluation of this model is based on this architecture.

  10. Computer simulation program for parallel SITAN. [Sandia Inertia Terrain-Aided Navigation, in FORTRAN

    SciTech Connect

    Andreas, R.D.; Sheives, T.C.

    1980-11-01

    This computer program simulates the operation of parallel SITAN using digitized terrain data. An actual trajectory is modeled including the effects of inertial navigation errors and radar altimeter measurements.

  11. F-Nets and Software Cabling: Deriving a Formal Model and Language for Portable Parallel Programming

    NASA Technical Reports Server (NTRS)

    DiNucci, David C.; Saini, Subhash (Technical Monitor)

    1998-01-01

    Parallel programming is still being based upon antiquated sequence-based definitions of the terms "algorithm" and "computation", resulting in programs which are architecture dependent and difficult to design and analyze. By focusing on obstacles inherent in existing practice, a more portable model is derived here, which is then formalized into a model called Soviets which utilizes a combination of imperative and functional styles. This formalization suggests more general notions of algorithm and computation, as well as insights into the meaning of structured programming in a parallel setting. To illustrate how these principles can be applied, a very-high-level graphical architecture-independent parallel language, called Software Cabling, is described, with many of the features normally expected from today's computer languages (e.g. data abstraction, data parallelism, and object-based programming constructs).

  12. A comparison using APPL and PVM for a parallel implementation of an unstructured grid generation program

    NASA Technical Reports Server (NTRS)

    Arthur, Trey; Bockelie, Michael J.

    1993-01-01

    Efforts to parallelize the VGRIDSG unstructured surface grid generation program are described. The inherent parallel nature of the grid generation algorithm used in VGRIDSG was exploited on a cluster of Silicon Graphics IRIS 4D workstations using the message passing libraries Application Portable Parallel Library (APPL) and Parallel Virtual Machine (PVM). Comparisons of speed up are presented for generating the surface grid of a unit cube and a Mach 3.0 High Speed Civil Transport. It was concluded that for this application, both APPL and PVM give approximately the same performance, however, APPL is easier to use.

  13. Backtracking and Re-execution in the Automatic Debugging of Parallelized Programs

    NASA Technical Reports Server (NTRS)

    Matthews, Gregory; Hood, Robert; Johnson, Stephen; Leggett, Peter; Biegel, Bryan (Technical Monitor)

    2002-01-01

    In this work we describe a new approach using relative debugging to find differences in computation between a serial program and a parallel version of th it program. We use a combination of re-execution and backtracking in order to find the first difference in computation that may ultimately lead to an incorrect value that the user has indicated. In our prototype implementation we use static analysis information from a parallelization tool in order to perform the backtracking as well as the mapping required between serial and parallel computations.

  14. Comparative Study of Message Passing and Shared Memory Parallel Programming Models in Neural Network Training

    SciTech Connect

    Vitela, J.; Gordillo, J.; Cortina, L; Hanebutte, U.

    1999-12-14

    It is presented a comparative performance study of a coarse grained parallel neural network training code, implemented in both OpenMP and MPI, standards for shared memory and message passing parallel programming environments, respectively. In addition, these versions of the parallel training code are compared to an implementation utilizing SHMEM the native SGI/CRAY environment for shared memory programming. The multiprocessor platform used is a SGI/Cray Origin 2000 with up to 32 processors. It is shown that in this study, the native CRAY environment outperforms MPI for the entire range of processors used, while OpenMP shows better performance than the other two environments when using more than 19 processors. In this study, the efficiency is always greater than 60% regardless of the parallel programming environment used as well as of the number of processors.

  15. Multiprocessor speed-up, Amdahl's Law, and the Activity Set Model of parallel program behavior

    NASA Technical Reports Server (NTRS)

    Gelenbe, Erol

    1988-01-01

    An important issue in the effective use of parallel processing is the estimation of the speed-up one may expect as a function of the number of processors used. Amdahl's Law has traditionally provided a guideline to this issue, although it appears excessively pessimistic in the light of recent experimental results. In this note, Amdahl's Law is amended by giving a greater importance to the capacity of a program to make effective use of parallel processing, but also recognizing the fact that imbalance of the workload of each processor is bound to occur. An activity set model of parallel program behavior is then introduced along with the corresponding parallelism index of a program, leading to upper and lower bounds to the speed-up.

  16. Concurrent extensions to the FORTRAN language for parallel programming of computational fluid dynamics algorithms

    NASA Technical Reports Server (NTRS)

    Weeks, Cindy Lou

    1986-01-01

    Experiments were conducted at NASA Ames Research Center to define multi-tasking software requirements for multiple-instruction, multiple-data stream (MIMD) computer architectures. The focus was on specifying solutions for algorithms in the field of computational fluid dynamics (CFD). The program objectives were to allow researchers to produce usable parallel application software as soon as possible after acquiring MIMD computer equipment, to provide researchers with an easy-to-learn and easy-to-use parallel software language which could be implemented on several different MIMD machines, and to enable researchers to list preferred design specifications for future MIMD computer architectures. Analysis of CFD algorithms indicated that extensions of an existing programming language, adaptable to new computer architectures, provided the best solution to meeting program objectives. The CoFORTRAN Language was written in response to these objectives and to provide researchers a means to experiment with parallel software solutions to CFD algorithms on machines with parallel architectures.

  17. Dual and parallel postdoctoral training programs: implications for the osteopathic medical profession.

    PubMed

    Burkhart, Diane N; Lischka, Terri A

    2011-04-01

    Students in colleges of osteopathic medicine have several options when considering postdoctoral training programs. In addition to training programs approved solely by the American Osteopathic Association or accredited solely by the Accreditation Council for Graduate Medical Education (ACGME), students can pursue programs accredited by both organizations (ie, dually accredited programs) or osteopathic programs that occur side-by-side with ACGME programs (ie, parallel programs). In the present article, we report on the availability and growth of these 2 training options and describe their benefits and drawbacks for trainees and the osteopathic medical profession as a whole.

  18. Using CLIPS in the domain of knowledge-based massively parallel programming

    NASA Technical Reports Server (NTRS)

    Dvorak, Jiri J.

    1994-01-01

    The Program Development Environment (PDE) is a tool for massively parallel programming of distributed-memory architectures. Adopting a knowledge-based approach, the PDE eliminates the complexity introduced by parallel hardware with distributed memory and offers complete transparency in respect of parallelism exploitation. The knowledge-based part of the PDE is realized in CLIPS. Its principal task is to find an efficient parallel realization of the application specified by the user in a comfortable, abstract, domain-oriented formalism. A large collection of fine-grain parallel algorithmic skeletons, represented as COOL objects in a tree hierarchy, contains the algorithmic knowledge. A hybrid knowledge base with rule modules and procedural parts, encoding expertise about application domain, parallel programming, software engineering, and parallel hardware, enables a high degree of automation in the software development process. In this paper, important aspects of the implementation of the PDE using CLIPS and COOL are shown, including the embedding of CLIPS with C++-based parts of the PDE. The appropriateness of the chosen approach and of the CLIPS language for knowledge-based software engineering are discussed.

  19. Programming a massively parallel, computation universal system: static behavior

    SciTech Connect

    Lapedes, A.; Farber, R.

    1986-01-01

    In previous work by the authors, the ''optimum finding'' properties of Hopfield neural nets were applied to the nets themselves to create a ''neural compiler.'' This was done in such a way that the problem of programming the attractors of one neural net (called the Slave net) was expressed as an optimization problem that was in turn solved by a second neural net (the Master net). In this series of papers that approach is extended to programming nets that contain interneurons (sometimes called ''hidden neurons''), and thus deals with nets capable of universal computation. 22 refs.

  20. Parallel Goals of the Early Childhood Music Program.

    ERIC Educational Resources Information Center

    Cohen, Veronica Wolf

    Early childhood music programs should be based on two interacting goals: (1) to teach those skills most appropriate to a particular level and (2) to nurture musical creativity and self-expression. Early childhood is seen as the optimum time for acquiring certain musical skills, of which the ability to sing in tune is considered primary. The vocal…

  1. Concurrent Programming Using Actors: Exploiting Large-Scale Parallelism,

    DTIC Science & Technology

    1985-10-07

    ORGANIZATION NAME AND ADDRESS 10. PROGRAM ELEMENT. PROJECT. TASK* Artificial Inteligence Laboratory AREA Is WORK UNIT NUMBERS 545 Technology Square...D-R162 422 CONCURRENT PROGRMMIZNG USING f"OS XL?ITP TEH l’ LARGE-SCALE PARALLELISH(U) NASI AC E Al CAMBRIDGE ARTIFICIAL INTELLIGENCE L. G AGHA ET AL...RESOLUTION TEST CHART N~ATIONAL BUREAU OF STANDA.RDS - -96 A -E. __ _ __ __’ .,*- - -- •. - MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL

  2. Parallelizing Deadlock Resolution in Symbolic Synthesis of Distributed Programs

    DTIC Science & Technology

    2008-01-01

    follows. In Sections 2 and 3, we present precise defini- tions for distributed programs, specifications, and fault- tolerance. We formally state the...Subsequently, experimental results and analysis are presented in Section 6. Related work is discussed in Section 7. Finally, we conclude in Section...infinite com- putation by stuttering at sl. On the other hand, if there exists a state sd such that there is no outgoing transition (or a self-loop

  3. Managing Parallelism and Resources in Scientific Dataflow Programs

    DTIC Science & Technology

    1990-03-01

    Fortunately, the programmer has the most intuition at that level and can offer guidance through simple annotations. At middle levels, where analysis is...hard and the programmer is likely to have little intuition, we can generally do well with rather crude analysis. Finally, in Chapter 9, we take a...recurrences into the innermost loop. This may be particularly important when a very high-level programming style is adopted and the programmer has not

  4. 78 FR 76628 - Pilot Program for Parallel Review of Medical Products; Extension of the Duration of the Program

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-12-18

    ...The Food and Drug Administration (FDA) and the Centers for Medicare and Medicaid Services (CMS) (the Agencies) are announcing the extension of the ``Pilot Program for Parallel Review of Medical Products.'' The Agencies have decided to continue the program as currently designed for an additional period of 2 years from the date of publication of this notice.

  5. Parallel programming of saccades during natural scene viewing: evidence from eye movement positions.

    PubMed

    Wu, Esther X W; Gilani, Syed Omer; van Boxtel, Jeroen J A; Amihai, Ido; Chua, Fook Kee; Yen, Shih-Cheng

    2013-10-24

    Previous studies have shown that saccade plans during natural scene viewing can be programmed in parallel. This evidence comes mainly from temporal indicators, i.e., fixation durations and latencies. In the current study, we asked whether eye movement positions recorded during scene viewing also reflect parallel programming of saccades. As participants viewed scenes in preparation for a memory task, their inspection of the scene was suddenly disrupted by a transition to another scene. We examined whether saccades after the transition were invariably directed immediately toward the center or were contingent on saccade onset times relative to the transition. The results, which showed a dissociation in eye movement behavior between two groups of saccades after the scene transition, supported the parallel programming account. Saccades with relatively long onset times (>100 ms) after the transition were directed immediately toward the center of the scene, probably to restart scene exploration. Saccades with short onset times (<100 ms) moved to the center only one saccade later. Our data on eye movement positions provide novel evidence of parallel programming of saccades during scene viewing. Additionally, results from the analyses of intersaccadic intervals were also consistent with the parallel programming hypothesis.

  6. Describing, using 'recognition cones'. [parallel-series model with English-like computer program

    NASA Technical Reports Server (NTRS)

    Uhr, L.

    1973-01-01

    A parallel-serial 'recognition cone' model is examined, taking into account the model's ability to describe scenes of objects. An actual program is presented in an English-like language. The concept of a 'description' is discussed together with possible types of descriptive information. Questions regarding the level and the variety of detail are considered along with approaches for improving the serial representations of parallel systems.

  7. Method for resource control in parallel environments using program organization and run-time support

    NASA Technical Reports Server (NTRS)

    Ekanadham, Kattamuri (Inventor); Moreira, Jose Eduardo (Inventor); Naik, Vijay Krishnarao (Inventor)

    2001-01-01

    A system and method for dynamic scheduling and allocation of resources to parallel applications during the course of their execution. By establishing well-defined interactions between an executing job and the parallel system, the system and method support dynamic reconfiguration of processor partitions, dynamic distribution and redistribution of data, communication among cooperating applications, and various other monitoring actions. The interactions occur only at specific points in the execution of the program where the aforementioned operations can be performed efficiently.

  8. Method for resource control in parallel environments using program organization and run-time support

    NASA Technical Reports Server (NTRS)

    Ekanadham, Kattamuri (Inventor); Moreira, Jose Eduardo (Inventor); Naik, Vijay Krishnarao (Inventor)

    1999-01-01

    A system and method for dynamic scheduling and allocation of resources to parallel applications during the course of their execution. By establishing well-defined interactions between an executing job and the parallel system, the system and method support dynamic reconfiguration of processor partitions, dynamic distribution and redistribution of data, communication among cooperating applications, and various other monitoring actions. The interactions occur only at specific points in the execution of the program where the aforementioned operations can be performed efficiently.

  9. CRBLASTER: A Fast Parallel-Processing Program for Cosmic Ray Rejection in Space-Based Observations

    NASA Astrophysics Data System (ADS)

    Mighell, K.

    Many astronomical image analysis tasks are based on algorithms that can be described as being embarrassingly parallel - where the analysis of one subimage generally does not affect the analysis of another subimage. Yet few parallel-processing astrophysical image-analysis programs exist that can easily take full advantage of today's fast multi-core servers costing a few thousands of dollars. One reason for the shortage of state-of-the-art parallel processing astrophysical image-analysis codes is that the writing of parallel codes has been perceived to be difficult. I describe a new fast parallel-processing image-analysis program called CRBLASTER which does cosmic ray rejection using van Dokkum's L.A.Cosmic algorithm. CRBLASTER is written in C using the industry standard Message Passing Interface library. Processing a single 800 x 800 Hubble Space Telescope Wide-Field Planetary Camera 2 (WFPC2) image takes 1.9 seconds using 4 processors on an Apple Xserve with two dual-core 3.0-GHz Intel Xeons; the efficiency of the program running with the 4 cores is 82%. The code has been designed to be used as a software framework for the easy development of parallel-processing image-analysis programs using embarrassing parallel algorithms; all that needs to be done is to replace the core image processing task (in this case the C function that performs the L.A.Cosmic algorithm) with an alternative image analysis task based on a single processor algorithm. I describe the design and implementation of the program and then discuss how it could possibly be used to quickly do time-critical analysis applications such as those involved with space surveillance or do complex calibration tasks as part of the pipeline processing of images from large focal plane arrays.

  10. Performance Evaluation of Remote Memory Access (RMA) Programming on Shared Memory Parallel Computers

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Jost, Gabriele; Biegel, Bryan A. (Technical Monitor)

    2002-01-01

    The purpose of this study is to evaluate the feasibility of remote memory access (RMA) programming on shared memory parallel computers. We discuss different RMA based implementations of selected CFD application benchmark kernels and compare them to corresponding message passing based codes. For the message-passing implementation we use MPI point-to-point and global communication routines. For the RMA based approach we consider two different libraries supporting this programming model. One is a shared memory parallelization library (SMPlib) developed at NASA Ames, the other is the MPI-2 extensions to the MPI Standard. We give timing comparisons for the different implementation strategies and discuss the performance.

  11. Summer institute in parallel programming (Organized by Ewing Lusk and William Gropp)

    SciTech Connect

    Pieper, G.W.

    1992-01-01

    On September 3--13, 1991, Argonne National Laboratory hosted a Summer Institute in Parallel Programming. The institute was organized by the Mathematics and Computer Science Division and was supported in part by the National Science Foundation and by the US Department of Energy. The objective of the institute was to familiarize graduate students and postdoctoral researchers with new methods and tools for parallel programming and to provide hands-on experience with a diverse array of advanced-computer architectures. This report summarizes the activities that took place during the ten-day institute.

  12. Empirical valence bond models for reactive potential energy surfaces: A parallel multilevel genetic program approach

    NASA Astrophysics Data System (ADS)

    Bellucci, Michael A.; Coker, David F.

    2011-07-01

    We describe a new method for constructing empirical valence bond potential energy surfaces using a parallel multilevel genetic program (PMLGP). Genetic programs can be used to perform an efficient search through function space and parameter space to find the best functions and sets of parameters that fit energies obtained by ab initio electronic structure calculations. Building on the traditional genetic program approach, the PMLGP utilizes a hierarchy of genetic programming on two different levels. The lower level genetic programs are used to optimize coevolving populations in parallel while the higher level genetic program (HLGP) is used to optimize the genetic operator probabilities of the lower level genetic programs. The HLGP allows the algorithm to dynamically learn the mutation or combination of mutations that most effectively increase the fitness of the populations, causing a significant increase in the algorithm's accuracy and efficiency. The algorithm's accuracy and efficiency is tested against a standard parallel genetic program with a variety of one-dimensional test cases. Subsequently, the PMLGP is utilized to obtain an accurate empirical valence bond model for proton transfer in 3-hydroxy-gamma-pyrone in gas phase and protic solvent.

  13. Empirical valence bond models for reactive potential energy surfaces: a parallel multilevel genetic program approach.

    PubMed

    Bellucci, Michael A; Coker, David F

    2011-07-28

    We describe a new method for constructing empirical valence bond potential energy surfaces using a parallel multilevel genetic program (PMLGP). Genetic programs can be used to perform an efficient search through function space and parameter space to find the best functions and sets of parameters that fit energies obtained by ab initio electronic structure calculations. Building on the traditional genetic program approach, the PMLGP utilizes a hierarchy of genetic programming on two different levels. The lower level genetic programs are used to optimize coevolving populations in parallel while the higher level genetic program (HLGP) is used to optimize the genetic operator probabilities of the lower level genetic programs. The HLGP allows the algorithm to dynamically learn the mutation or combination of mutations that most effectively increase the fitness of the populations, causing a significant increase in the algorithm's accuracy and efficiency. The algorithm's accuracy and efficiency is tested against a standard parallel genetic program with a variety of one-dimensional test cases. Subsequently, the PMLGP is utilized to obtain an accurate empirical valence bond model for proton transfer in 3-hydroxy-gamma-pyrone in gas phase and protic solvent.

  14. High performance parallel computers for science: New developments at the Fermilab advanced computer program

    SciTech Connect

    Nash, T.; Areti, H.; Atac, R.; Biel, J.; Cook, A.; Deppe, J.; Edel, M.; Fischler, M.; Gaines, I.; Hance, R.

    1988-08-01

    Fermilab's Advanced Computer Program (ACP) has been developing highly cost effective, yet practical, parallel computers for high energy physics since 1984. The ACP's latest developments are proceeding in two directions. A Second Generation ACP Multiprocessor System for experiments will include $3500 RISC processors each with performance over 15 VAX MIPS. To support such high performance, the new system allows parallel I/O, parallel interprocess communication, and parallel host processes. The ACP Multi-Array Processor, has been developed for theoretical physics. Each $4000 node is a FORTRAN or C programmable pipelined 20 MFlops (peak), 10 MByte single board computer. These are plugged into a 16 port crossbar switch crate which handles both inter and intra crate communication. The crates are connected in a hypercube. Site oriented applications like lattice gauge theory are supported by system software called CANOPY, which makes the hardware virtually transparent to users. A 256 node, 5 GFlop, system is under construction. 10 refs., 7 figs.

  15. Resolutions of the Coulomb operator: VIII. Parallel implementation using the modern programming language X10.

    PubMed

    Limpanuparb, Taweetham; Milthorpe, Josh; Rendell, Alistair P

    2014-10-30

    Use of the modern parallel programming language X10 for computing long-range Coulomb and exchange interactions is presented. By using X10, a partitioned global address space language with support for task parallelism and the explicit representation of data locality, the resolution of the Ewald operator can be parallelized in a straightforward manner including use of both intranode and internode parallelism. We evaluate four different schemes for dynamic load balancing of integral calculation using X10's work stealing runtime, and report performance results for long-range HF energy calculation of large molecule/high quality basis running on up to 1024 cores of a high performance cluster machine. Copyright © 2014 Wiley Periodicals, Inc.

  16. Architecture-Adaptive Computing Environment: A Tool for Teaching Parallel Programming

    NASA Technical Reports Server (NTRS)

    Dorband, John E.; Aburdene, Maurice F.

    2002-01-01

    Recently, networked and cluster computation have become very popular. This paper is an introduction to a new C based parallel language for architecture-adaptive programming, aCe C. The primary purpose of aCe (Architecture-adaptive Computing Environment) is to encourage programmers to implement applications on parallel architectures by providing them the assurance that future architectures will be able to run their applications with a minimum of modification. A secondary purpose is to encourage computer architects to develop new types of architectures by providing an easily implemented software development environment and a library of test applications. This new language should be an ideal tool to teach parallel programming. In this paper, we will focus on some fundamental features of aCe C.

  17. 76 FR 62808 - Pilot Program for Parallel Review of Medical Products

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-10-11

    ... HUMAN SERVICES Centers for Medicare and Medicaid Services Food and Drug Administration Pilot Program for Parallel Review of Medical Products AGENCY: Food and Drug Administration, Centers for Medicare and Medicaid Services, HHS. ACTION: Notice. SUMMARY: The Food and Drug Administration (FDA) and the Centers for...

  18. 76 FR 66309 - Pilot Program for Parallel Review of Medical Products; Correction

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-10-26

    ... HUMAN SERVICES Centers for Medicare and Medicaid Services Food and Drug Administration Pilot Program for Parallel Review of Medical Products; Correction AGENCY: Food and Drug Administration, Centers for Medicare and Medicaid Services, HHS. ACTION: Notice; correction. SUMMARY: The Food and Drug Administration (FDA...

  19. Contrastive analysis of three parallel modes in multi-dimensional dynamic programming and its application in cascade reservoirs operation

    NASA Astrophysics Data System (ADS)

    Zhang, Yanke; Jiang, Zhiqiang; Ji, Changming; Sun, Ping

    2015-10-01

    The "curse of dimensionality" of dynamic programming (DP) has always been a great challenge to the cascade reservoirs operation optimization (CROO) because computer memory and computational time increase exponentially with the increasing number of reservoirs. It is an effective measure to combine DP with the parallel processing technology to improve the performance. This paper proposes three parallel modes for multi-dimensional dynamic programming (MDP) based on .NET4 Parallel Extensions, i.e., the stages parallel mode, state combinations parallel mode and hybrid parallel mode. A cascade reservoirs of Li Xiangjiang River in China is used as the study instance in this paper, and a detailed contrastive analysis of the three parallel modes on run-time, parallel acceleration ratio, parallel efficiency and memory usage has been implemented based on the parallel computing results. Results show that all the three parallel modes can effectively shorten the run-time so that to alleviate the "curse of dimensionality" of MDP, but relatively, the state combinations parallel mode is the optimal, the hybrid parallel is suboptimal and the stages parallel mode is poor.

  20. A library-based approach to portable, parallel, object-oriented programming: Interface, implementation, and application

    SciTech Connect

    Parkes, S.; Chandy, J.A.; Banerjee, P.

    1994-12-31

    The use of parallel platforms, despite increasing availability, remains largely restricted to well-structured, numeric applications. The authors address the issue of facilitating the use of parallel platforms on unstructured problems through object-oriented design techniques and the actor model of concurrent computation. They present a multi-level approach to expressing parallelism for unstructured applications: a high-level interface based on the actor model of concurrent object-oriented programming and a low-level interface which provides an object-oriented interface to system services across a wide range of parallel architectures. The high- and low-level interfaces are implemented as part of the ProperCAD II C++ class library which supports shared-memory, message-passing, and hybrid architectures. The authors demonstrate their approach through a detailed examination of the parallelization process for an existing unstructured serial application, a state-of-the-art VLSI computer-aided design application. They compare and contrast the library-based actor approach to other methods for expressing parallelism in C++ on a number of applications and kernels.

  1. Concurrent Collections (CnC): A new approach to parallel programming

    SciTech Connect

    2010-05-07

    A common approach in designing parallel languages is to provide some high level handles to manipulate the use of the parallel platform. This exposes some aspects of the target platform, for example, shared vs. distributed memory. It may expose some but not all types of parallelism, for example, data parallelism but not task parallelism. This approach must find a balance between the desire to provide a simple view for the domain expert and provide sufficient power for tuning. This is hard for any given architecture and harder if the language is to apply to a range of architectures. Either simplicity or power is lost. Instead of viewing the language design problem as one of providing the programmer with high level handles, we view the problem as one of designing an interface. On one side of this interface is the programmer (domain expert) who knows the application but needs no knowledge of any aspects of the platform. On the other side of the interface is the performance expert (programmer or program) who demands maximal flexibility for optimizing the mapping to a wide range of target platforms (parallel / serial, shared / distributed, homogeneous / heterogeneous, etc.) but needs no knowledge of the domain. Concurrent Collections (CnC) is based on this separation of concerns. The talk will present CnC and its benefits. About the speaker Kathleen Knobe has focused throughout her career on parallelism especially compiler technology, runtime system design and language design. She worked at Compass (aka Massachusetts Computer Associates) from 1980 to 1991 designing compilers for a wide range of parallel platforms for Thinking Machines, MasPar, Alliant, Numerix, and several government projects. In 1991 she decided to finish her education. After graduating from MIT in 1997, she joined Digital Equipment’s Cambridge Research Lab (CRL). She stayed through the DEC/Compaq/HP mergers and when CRL was acquired and absorbed by Intel. She currently works in the Software and

  2. LDRD final report on massively-parallel linear programming : the parPCx system.

    SciTech Connect

    Parekh, Ojas; Phillips, Cynthia Ann; Boman, Erik Gunnar

    2005-02-01

    This report summarizes the research and development performed from October 2002 to September 2004 at Sandia National Laboratories under the Laboratory-Directed Research and Development (LDRD) project ''Massively-Parallel Linear Programming''. We developed a linear programming (LP) solver designed to use a large number of processors. LP is the optimization of a linear objective function subject to linear constraints. Companies and universities have expended huge efforts over decades to produce fast, stable serial LP solvers. Previous parallel codes run on shared-memory systems and have little or no distribution of the constraint matrix. We have seen no reports of general LP solver runs on large numbers of processors. Our parallel LP code is based on an efficient serial implementation of Mehrotra's interior-point predictor-corrector algorithm (PCx). The computational core of this algorithm is the assembly and solution of a sparse linear system. We have substantially rewritten the PCx code and based it on Trilinos, the parallel linear algebra library developed at Sandia. Our interior-point method can use either direct or iterative solvers for the linear system. To achieve a good parallel data distribution of the constraint matrix, we use a (pre-release) version of a hypergraph partitioner from the Zoltan partitioning library. We describe the design and implementation of our new LP solver called parPCx and give preliminary computational results. We summarize a number of issues related to efficient parallel solution of LPs with interior-point methods including data distribution, numerical stability, and solving the core linear system using both direct and iterative methods. We describe a number of applications of LP specific to US Department of Energy mission areas and we summarize our efforts to integrate parPCx (and parallel LP solvers in general) into Sandia's massively-parallel integer programming solver PICO (Parallel Interger and Combinatorial Optimizer). We

  3. Hybrid-view programming of nuclear fusion simulation code in the PGAS parallel programming language XcalableMP

    DOE PAGES

    Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi; ...

    2016-06-01

    Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensionalmore » gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.« less

  4. Hybrid-view programming of nuclear fusion simulation code in the PGAS parallel programming language XcalableMP

    SciTech Connect

    Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi; Sato, Mitsuhisa; Tang, William; Wang, Bei

    2016-06-01

    Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensional gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.

  5. Hybrid-view programming of nuclear fusion simulation code in the PGAS parallel programming language XcalableMP

    SciTech Connect

    Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi; Sato, Mitsuhisa; Tang, William; Wang, Bei

    2016-06-01

    Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensional gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.

  6. Speedup properties of phases in the execution profile of distributed parallel programs

    SciTech Connect

    Carlson, B.M. . Computer Systems Research Institute); Wagner, T.D.; Dowdy, L.W. . Dept. of Computer Science); Worley, P.H. )

    1992-08-01

    The execution profile of a distributed-memory parallel program specifies the number of busy processors as a function of time. Periods of homogeneous processor utilization are manifested in many execution profiles. These periods can usually be correlated with the algorithms implemented in the underlying parallel code. Three families of methods for smoothing execution profile data are presented. These approaches simplify the problem of detecting end points of periods of homogeneous utilization. These periods, called phases, are then examined in isolation, and their speedup characteristics are explored. A specific workload executed on an Intel iPSC/860 is used for validation of the techniques described.

  7. Exact Sciences' experience with the FDA and CMS parallel review program.

    PubMed

    Ridge, John R; Statz, Sandra

    2015-01-01

    Colorectal cancer (CRC) is the third most commonly diagnosed cancer and the second leading cause of cancer death among men and women combined in the USA. Although the benefits of early CRC detection are widely recognized, screening rates are suboptimal. Cologuard is a multitarget stool DNA screening test that offers a unique non-invasive option for CRC screening. Cologuard was the first product to be reviewed under a pilot parallel review program jointly conducted by the US FDA and the Centers for Medicare & Medicaid Services (CMS). This parallel review process shortened the overall review for Cologuard and resulted in a preliminary National Coverage Determination that coincided with FDA approval.

  8. Method, systems, and computer program products for implementing function-parallel network firewall

    DOEpatents

    Fulp, Errin W [Winston-Salem, NC; Farley, Ryan J [Winston-Salem, NC

    2011-10-11

    Methods, systems, and computer program products for providing function-parallel firewalls are disclosed. According to one aspect, a function-parallel firewall includes a first firewall node for filtering received packets using a first portion of a rule set including a plurality of rules. The first portion includes less than all of the rules in the rule set. At least one second firewall node filters packets using a second portion of the rule set. The second portion includes at least one rule in the rule set that is not present in the first portion. The first and second portions together include all of the rules in the rule set.

  9. Static Analysis of Run-Time Errors in Embedded Critical Parallel C Programs

    NASA Astrophysics Data System (ADS)

    Miné, Antoine

    We present a static analysis by Abstract Interpretation to check for run-time errors in parallel C programs. Following our work on Astrée, we focus on embedded critical programs without recursion nor dynamic memory allocation, but extend the analysis to a static set of threads. Our method iterates a slightly modified non-parallel analysis over each thread in turn, until thread interferences stabilize. We prove the soundness of the method with respect to a sequential consistent semantics and a reasonable weakly consistent memory semantics. We then show how to take into account mutual exclusion and thread priorities through partitioning over the scheduler state. We present preliminary experimental results analyzing a real program with our prototype, Thésée, and demonstrate the scalability of our approach.

  10. PINCA: A scalable parallel program for compressible gas dynamics with nonequilibrium chemistry

    NASA Astrophysics Data System (ADS)

    Wong, C. C.; Blottner, F. G.; Payne, J. L.; Soetrisno, M.; Imlay, S. T.

    1995-04-01

    This report documents an exploratory research work, funded by the Laboratory Directed Research and Development (LDRD) office at Sandia National Laboratories, to develop an advanced, general purpose, robust compressible flow solver for handling large, complex, chemically reacting gas dynamics problems. The deliverable of this project, a computer program called PINCA (Parallel INtegrated Computer Analysis) will run on massively parallel computers such as the Intel/Gamma and Intel/Paragon. With the development of this parallel compressible flow solver, engineers will be better able to address large three-dimensional scientific arid engineering problems involving multi-component gas mixtures with finite rate chemistry. These problems occur in high temperature industrial processes, combustion, and hypersonic: reentry of space-crafts.

  11. Methodologies and Tools for Tuning Parallel Programs: 80% Art, 20% Science, and 10% Luck

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Bailey, David (Technical Monitor)

    1996-01-01

    The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessors. However, without effective means to monitor (and analyze) program execution, tuning the performance of parallel programs becomes exponentially difficult as program complexity and machine size increase. In the past few years, the ubiquitous introduction of performance tuning tools from various supercomputer vendors (Intel's ParAide, TMC's PRISM, CRI's Apprentice, and Convex's CXtrace) seems to indicate the maturity of performance instrumentation/monitor/tuning technologies and vendors'/customers' recognition of their importance. However, a few important questions remain: What kind of performance bottlenecks can these tools detect (or correct)? How time consuming is the performance tuning process? What are some important technical issues that remain to be tackled in this area? This workshop reviews the fundamental concepts involved in analyzing and improving the performance of parallel and heterogeneous message-passing programs. Several alternative strategies will be contrasted, and for each we will describe how currently available tuning tools (e.g. AIMS, ParAide, PRISM, Apprentice, CXtrace, ATExpert, Pablo, IPS-2) can be used to facilitate the process. We will characterize the effectiveness of the tools and methodologies based on actual user experiences at NASA Ames Research Center. Finally, we will discuss their limitations and outline recent approaches taken by vendors and the research community to address them.

  12. Work stealing for GPU-accelerated parallel programs in a global address space framework

    SciTech Connect

    Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram; Balaji, Pavan; Sadayappan, P.

    2016-09-10

    Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain

  13. Enabling Requirements-Based Programming for Highly-Dependable Complex Parallel and Distributed Systems

    NASA Technical Reports Server (NTRS)

    Hinchey, Michael G.; Rash, James L.; Rouff, Christopher A.

    2005-01-01

    The manual application of formal methods in system specification has produced successes, but in the end, despite any claims and assertions by practitioners, there is no provable relationship between a manually derived system specification or formal model and the customer's original requirements. Complex parallel and distributed system present the worst case implications for today s dearth of viable approaches for achieving system dependability. No avenue other than formal methods constitutes a serious contender for resolving the problem, and so recognition of requirements-based programming has come at a critical juncture. We describe a new, NASA-developed automated requirement-based programming method that can be applied to certain classes of systems, including complex parallel and distributed systems, to achieve a high degree of dependability.

  14. A computer program based on parallel line assay for analysis of skin tests.

    PubMed

    Martín, S; Cuesta, P; Rico, P; Cortés, C

    1997-01-01

    A computer program for the analysis of differences or changes in skin sensitivity has been developed. It is based on parallel line assay, and its main features are its ability to conduct a validation process which ensures that the data from skin tests conform to the conditions imposed by the analysis which is carried out (regression, parallelism, etc.), the estimation of the difference or change in skin sensitivity, and the determination of the 95% and 99% confidence intervals of this estimation. This program is capable of managing data from independent groups, as well as paired data, and it may be applied to the comparison of allergen extracts, with the aim of determining their biologic activity, as well as to the analysis of changes in skin sensitivity appearing as a consequence of treatment such as immunotherapy.

  15. Class Notes: Programming Parallel Algorithms CS 15-840B (Fall 1992)

    DTIC Science & Technology

    1993-02-01

    840: Programming Parallel Algorithms Lecture #15 Scribe: Bob Wheeler Thursday, 6 Nov 92 Overview * Connected components (continued). * Minimum spanning...Sriram Sethuraman Singular value decomposition Ken Tew EEG analysis Eric Thayer Speech recognition Xuemei Wang & Bob Wheeler Matrix operations Matt...Computing, 14(4):862-874, 1985. [33] L. W. Tucker, C. R. Feynman , and D. M. Fritzsche. Object recognition using the Connection Machine. Proceedings CVPR 󈨜

  16. Dynamic programming in parallel boundary detection with application to ultrasound intima-media segmentation.

    PubMed

    Zhou, Yuan; Cheng, Xinyao; Xu, Xiangyang; Song, Enmin

    2013-12-01

    Segmentation of carotid artery intima-media in longitudinal ultrasound images for measuring its thickness to predict cardiovascular diseases can be simplified as detecting two nearly parallel boundaries within a certain distance range, when plaque with irregular shapes is not considered. In this paper, we improve the implementation of two dynamic programming (DP) based approaches to parallel boundary detection, dual dynamic programming (DDP) and piecewise linear dual dynamic programming (PL-DDP). Then, a novel DP based approach, dual line detection (DLD), which translates the original 2-D curve position to a 4-D parameter space representing two line segments in a local image segment, is proposed to solve the problem while maintaining efficiency and rotation invariance. To apply the DLD to ultrasound intima-media segmentation, it is imbedded in a framework that employs an edge map obtained from multiplication of the responses of two edge detectors with different scales and a coupled snake model that simultaneously deforms the two contours for maintaining parallelism. The experimental results on synthetic images and carotid arteries of clinical ultrasound images indicate improved performance of the proposed DLD compared to DDP and PL-DDP, with respect to accuracy and efficiency. Copyright © 2013 Elsevier B.V. All rights reserved.

  17. Support of Multidimensional Parallelism in the OpenMP Programming Model

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Jost, Gabriele

    2003-01-01

    OpenMP is the current standard for shared-memory programming. While providing ease of parallel programming, the OpenMP programming model also has limitations which often effect the scalability of applications. Examples for these limitations are work distribution and point-to-point synchronization among threads. We propose extensions to the OpenMP programming model which allow the user to easily distribute the work in multiple dimensions and synchronize the workflow among the threads. The proposed extensions include four new constructs and the associated runtime library. They do not require changes to the source code and can be implemented based on the existing OpenMP standard. We illustrate the concept in a prototype translator and test with benchmark codes and a cloud modeling code.

  18. Development, Verification and Validation of Parallel, Scalable Volume of Fluid CFD Program for Propulsion Applications

    NASA Technical Reports Server (NTRS)

    West, Jeff; Yang, H. Q.

    2014-01-01

    There are many instances involving liquid/gas interfaces and their dynamics in the design of liquid engine powered rockets such as the Space Launch System (SLS). Some examples of these applications are: Propellant tank draining and slosh, subcritical condition injector analysis for gas generators, preburners and thrust chambers, water deluge mitigation for launch induced environments and even solid rocket motor liquid slag dynamics. Commercially available CFD programs simulating gas/liquid interfaces using the Volume of Fluid approach are currently limited in their parallel scalability. In 2010 for instance, an internal NASA/MSFC review of three commercial tools revealed that parallel scalability was seriously compromised at 8 cpus and no additional speedup was possible after 32 cpus. Other non-interface CFD applications at the time were demonstrating useful parallel scalability up to 4,096 processors or more. Based on this review, NASA/MSFC initiated an effort to implement a Volume of Fluid implementation within the unstructured mesh, pressure-based algorithm CFD program, Loci-STREAM. After verification was achieved by comparing results to the commercial CFD program CFD-Ace+, and validation by direct comparison with data, Loci-STREAM-VoF is now the production CFD tool for propellant slosh force and slosh damping rate simulations at NASA/MSFC. On these applications, good parallel scalability has been demonstrated for problems sizes of tens of millions of cells and thousands of cpu cores. Ongoing efforts are focused on the application of Loci-STREAM-VoF to predict the transient flow patterns of water on the SLS Mobile Launch Platform in order to support the phasing of water for launch environment mitigation so that vehicle determinantal effects are not realized.

  19. A parallel domain decomposition algorithm for coastal ocean circulation models based on integer linear programming

    NASA Astrophysics Data System (ADS)

    Jordi, Antoni; Georgas, Nickitas; Blumberg, Alan

    2017-05-01

    This paper presents a new parallel domain decomposition algorithm based on integer linear programming (ILP), a mathematical optimization method. To minimize the computation time of coastal ocean circulation models, the ILP decomposition algorithm divides the global domain in local domains with balanced work load according to the number of processors and avoids computations over as many as land grid cells as possible. In addition, it maintains the use of logically rectangular local domains and achieves the exact same results as traditional domain decomposition algorithms (such as Cartesian decomposition). However, the ILP decomposition algorithm may not converge to an exact solution for relatively large domains. To overcome this problem, we developed two ILP decomposition formulations. The first one (complete formulation) has no additional restriction, although it is impractical for large global domains. The second one (feasible) imposes local domains with the same dimensions and looks for the feasibility of such decomposition, which allows much larger global domains. Parallel performance of both ILP formulations is compared to a base Cartesian decomposition by simulating two cases with the newly created parallel version of the Stevens Institute of Technology's Estuarine and Coastal Ocean Model (sECOM). Simulations with the ILP formulations run always faster than the ones with the base decomposition, and the complete formulation is better than the feasible one when it is applicable. In addition, parallel efficiency with the ILP decomposition may be greater than one.

  20. The FORCE: A portable parallel programming language supporting computational structural mechanics

    NASA Technical Reports Server (NTRS)

    Jordan, Harry F.; Benten, Muhammad S.; Brehm, Juergen; Ramanan, Aruna

    1989-01-01

    This project supports the conversion of codes in Computational Structural Mechanics (CSM) to a parallel form which will efficiently exploit the computational power available from multiprocessors. The work is a part of a comprehensive, FORTRAN-based system to form a basis for a parallel version of the NICE/SPAR combination which will form the CSM Testbed. The software is macro-based and rests on the force methodology developed by the principal investigator in connection with an early scientific multiprocessor. Machine independence is an important characteristic of the system so that retargeting it to the Flex/32, or any other multiprocessor on which NICE/SPAR might be imnplemented, is well supported. The principal investigator has experience in producing parallel software for both full and sparse systems of linear equations using the force macros. Other researchers have used the Force in finite element programs. It has been possible to rapidly develop software which performs at maximum efficiency on a multiprocessor. The inherent machine independence of the system also means that the parallelization will not be limited to a specific multiprocessor.

  1. Concurrent Collections (CnC): A new approach to parallel programming

    ScienceCinema

    None

    2016-07-12

    A common approach in designing parallel languages is to provide some high level handles to manipulate the use of the parallel platform. This exposes some aspects of the target platform, for example, shared vs. distributed memory. It may expose some but not all types of parallelism, for example, data parallelism but not task parallelism. This approach must find a balance between the desire to provide a simple view for the domain expert and provide sufficient power for tuning. This is hard for any given architecture and harder if the language is to apply to a range of architectures. Either simplicity or power is lost. Instead of viewing the language design problem as one of providing the programmer with high level handles, we view the problem as one of designing an interface. On one side of this interface is the programmer (domain expert) who knows the application but needs no knowledge of any aspects of the platform. On the other side of the interface is the performance expert (programmer or program) who demands maximal flexibility for optimizing the mapping to a wide range of target platforms (parallel / serial, shared / distributed, homogeneous / heterogeneous, etc.) but needs no knowledge of the domain. Concurrent Collections (CnC) is based on this separation of concerns. The talk will present CnC and its benefits. About the speaker Kathleen Knobe has focused throughout her career on parallelism especially compiler technology, runtime system design and language design. She worked at Compass (aka Massachusetts Computer Associates) from 1980 to 1991 designing compilers for a wide range of parallel platforms for Thinking Machines, MasPar, Alliant, Numerix, and several government projects. In 1991 she decided to finish her education. After graduating from MIT in 1997, she joined Digital Equipment’s Cambridge Research Lab (CRL). She stayed through the DEC/Compaq/HP mergers and when CRL was acquired and absorbed by Intel. She currently works in the Software and

  2. Exploiting parallel computing with limited program changes using a network of microcomputers

    NASA Technical Reports Server (NTRS)

    Rogers, J. L., Jr.; Sobieszczanski-Sobieski, J.

    1985-01-01

    Network computing and multiprocessor computers are two discernible trends in parallel processing. The computational behavior of an iterative distributed process in which some subtasks are completed later than others because of an imbalance in computational requirements is of significant interest. The effects of asynchronus processing was studied. A small existing program was converted to perform finite element analysis by distributing substructure analysis over a network of four Apple IIe microcomputers connected to a shared disk, simulating a parallel computer. The substructure analysis uses an iterative, fully stressed, structural resizing procedure. A framework of beams divided into three substructures is used as the finite element model. The effects of asynchronous processing on the convergence of the design variables are determined by not resizing particular substructures on various iterations.

  3. Eighth SIAM conference on parallel processing for scientific computing: Final program and abstracts

    SciTech Connect

    1997-12-31

    This SIAM conference is the premier forum for developments in parallel numerical algorithms, a field that has seen very lively and fruitful developments over the past decade, and whose health is still robust. Themes for this conference were: combinatorial optimization; data-parallel languages; large-scale parallel applications; message-passing; molecular modeling; parallel I/O; parallel libraries; parallel software tools; parallel compilers; particle simulations; problem-solving environments; and sparse matrix computations.

  4. Enhancing Application Performance Using Mini-Apps: Comparison of Hybrid Parallel Programming Paradigms

    NASA Technical Reports Server (NTRS)

    Lawson, Gary; Poteat, Michael; Sosonkina, Masha; Baurle, Robert; Hammond, Dana

    2016-01-01

    In this work, several mini-apps have been created to enhance a real-world application performance, namely the VULCAN code for complex flow analysis developed at the NASA Langley Research Center. These mini-apps explore hybrid parallel programming paradigms with Message Passing Interface (MPI) for distributed memory access and either Shared MPI (SMPI) or OpenMP for shared memory accesses. Performance testing shows that MPI+SMPI yields the best execution performance, while requiring the largest number of code changes. A maximum speedup of 23X was measured for MPI+SMPI, but only 10X was measured for MPI+OpenMP.

  5. Full Parallel Implementation of an All-Electron Four-Component Dirac-Kohn-Sham Program.

    PubMed

    Rampino, Sergio; Belpassi, Leonardo; Tarantelli, Francesco; Storchi, Loriano

    2014-09-09

    A full distributed-memory implementation of the Dirac-Kohn-Sham (DKS) module of the program BERTHA (Belpassi et al., Phys. Chem. Chem. Phys. 2011, 13, 12368-12394) is presented, where the self-consistent field (SCF) procedure is replicated on all the parallel processes, each process working on subsets of the global matrices. The key feature of the implementation is an efficient procedure for switching between two matrix distribution schemes, one (integral-driven) optimal for the parallel computation of the matrix elements and another (block-cyclic) optimal for the parallel linear algebra operations. This approach, making both CPU-time and memory scalable with the number of processors used, virtually overcomes at once both time and memory barriers associated with DKS calculations. Performance, portability, and numerical stability of the code are illustrated on the basis of test calculations on three gold clusters of increasing size, an organometallic compound, and a perovskite model. The calculations are performed on a Beowulf and a BlueGene/Q system.

  6. Experiences with different parallel programming paradigms for Monte Carlo particle transport leads to a portable toolkit for parallel Monte Carlo

    SciTech Connect

    Martin, W.R.; Majumdar, A. . Dept. of Nuclear Engineering); Rathkopf, J.A. ); Litvin, M. )

    1993-04-01

    Monte Carlo particle transport is easy to implement on massively parallel computers relative to other methods of transport simulation. This paper describes experiences of implementing a realistic demonstration Monte Carlo code on a variety of parallel architectures. Our pool of tasks'' technique, which allows reproducibility from run to run regardless of the number of processors, is discussed. We present detailed timing studies of simulations performed on the 128 processor BBN-ACI TC2000 and preliminary timing results for the 32 processor Kendall Square Research KSR-1. Given sufficient workload to distribute across many computational nodes, the BBN achieves nearly linear speedup for a large number of nodes. The KSR, with which we have had less experience, performs poorly with more than ten processors. A simple model incorporating known causes of overhead accurately predicts observed behavior. A general-purpose communication and control package to facilitate the implementation of existing Monte Carlo packages is described together with timings on the BBN. This package adds insignificantly to the computational costs of parallel simulations.

  7. Experiences with different parallel programming paradigms for Monte Carlo particle transport leads to a portable toolkit for parallel Monte Carlo

    SciTech Connect

    Martin, W.R.; Majumdar, A.; Rathkopf, J.A.; Litvin, M.

    1993-04-01

    Monte Carlo particle transport is easy to implement on massively parallel computers relative to other methods of transport simulation. This paper describes experiences of implementing a realistic demonstration Monte Carlo code on a variety of parallel architectures. Our ``pool of tasks`` technique, which allows reproducibility from run to run regardless of the number of processors, is discussed. We present detailed timing studies of simulations performed on the 128 processor BBN-ACI TC2000 and preliminary timing results for the 32 processor Kendall Square Research KSR-1. Given sufficient workload to distribute across many computational nodes, the BBN achieves nearly linear speedup for a large number of nodes. The KSR, with which we have had less experience, performs poorly with more than ten processors. A simple model incorporating known causes of overhead accurately predicts observed behavior. A general-purpose communication and control package to facilitate the implementation of existing Monte Carlo packages is described together with timings on the BBN. This package adds insignificantly to the computational costs of parallel simulations.

  8. Molecular views of damaged DNA: Adaptation of the Program DUPLEX to parallel architectures

    SciTech Connect

    Hingerty, B.E.; Crawford, O.H.; Broyde, S.; Wagner, R.A.

    1994-09-01

    The nucleic acids molecular mechanics program DUPLEX has been designed with useful features for surveying the potential energy surface of polynucleotides, especially ones that are modified by polycyclic aromatic carcinogens. The program features helpful strategies for addressing the multiple minimum problem: (1) the reduced variable domain of torsion angle space; (2) search strategies that emphasize large scale searches for smaller subunits, followed by building to larger units by a variety of strategies; (3) the use of penalty functions to aid the minimizer in locating selected structural types in first stage minimizations; penalty functions are released in terminal minimizations to yield final unrestrained minimum energy conformations. Predictive capability is illustrated by DNA modified by activated benzo[a]pyrenes. The first stage of adaptation to parallel computers is described.

  9. Parallelization of the NAS Conjugate Gradient Benchmark Using the Global Arrays Shared Memory Programming Model

    SciTech Connect

    Zhang, Yeliang; Tipparaju, Vinod; Nieplocha, Jarek; Hariri, Salim

    2005-04-08

    The NAS Conjugate Gradient (CG) benchmark is an important scientific kernel used to evaluate machine performance and compare characteristics of different programming models. Global Arrays (GA) toolkit supports a shared memory programming paradigm— even on distributed memory systems— and offers the programmer control over the distribution and locality that are important for optimizing performance on scalable architectures. In this paper, we describe and compare two different parallelization strategies of the CG benchmark using GA and report performance results on a shared-memory system as well as on a cluster. Performance benefits of using shared memory for irregular/sparse computations have been demonstrated before in context of the CG benchmark using OpenMP. Similarly, the GA implementation outperforms the standard MPI implementation on shared memory system, in our case the SGI Altix. However, with GA these benefits are extended to distributed memory systems and demonstrated on a Linux cluster with Myrinet.

  10. Efficient iteration in data-parallel programs with irregular and dynamically distributed data structures

    SciTech Connect

    Littlefield, R.J.

    1990-02-01

    To implement an efficient data-parallel program on a non-shared memory MIMD multicomputer, data and computations must be properly partitioned to achieve good load balance and locality of reference. Programs with irregular data reference patterns often require irregular partitions. Although good partitions may be easy to determine, they can be difficult or impossible to implement in programming languages that provide only regular data distributions, such as blocked or cyclic arrays. We are developing Onyx, a programming system that provides a shared memory model of distributed data structures and extends the concept of data distribution to include irregular and dynamic distributions. This provides a powerful means to specify irregular partitions. Perhaps surprisingly, programs using it can also execute efficiently. In this paper, we describe and evaluate the Onyx implementation of a model problem that repeatedly executes an irregular but fixed data reference pattern. On an NCUBE hypercube, the speed of the Onyx implementation is comparable to that of carefully handwritten message-passing code.

  11. What is "the patient perspective" in patient engagement programs? Implicit logics and parallels to feminist theories.

    PubMed

    Rowland, Paula; McMillan, Sarah; McGillicuddy, Patti; Richards, Joy

    2017-01-01

    Public and patient involvement (PPI) in health care may refer to many different processes, ranging from participating in decision-making about one's own care to participating in health services research, health policy development, or organizational reforms. Across these many forms of public and patient involvement, the conceptual and theoretical underpinnings remain poorly articulated. Instead, most public and patient involvement programs rely on policy initiatives as their conceptual frameworks. This lack of conceptual clarity participates in dilemmas of program design, implementation, and evaluation. This study contributes to the development of theoretical understandings of public and patient involvement. In particular, we focus on the deployment of patient engagement programs within health service organizations. To develop a deeper understanding of the conceptual underpinnings of these programs, we examined the concept of "the patient perspective" as used by patient engagement practitioners and participants. Specifically, we focused on the way this phrase was used in the singular: "the" patient perspective or "the" patient voice. From qualitative analysis of interviews with 20 patient advisers and 6 staff members within a large urban health network in Canada, we argue that "the patient perspective" is referred to as a particular kind of situated knowledge, specifically an embodied knowledge of vulnerability. We draw parallels between this logic of patient perspective and the logic of early feminist theory, including the concepts of standpoint theory and strong objectivity. We suggest that champions of patient engagement may learn much from the way feminist theorists have constructed their arguments and addressed critique.

  12. Analysis of Parallel Algorithms on SMP Node and Cluster of Workstations Using Parallel Programming Models with New Tile-based Method for Large Biological Datasets.

    PubMed

    Shrimankar, D D; Sathe, S R

    2016-01-01

    Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today's supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures.

  13. Analysis of Parallel Algorithms on SMP Node and Cluster of Workstations Using Parallel Programming Models with New Tile-based Method for Large Biological Datasets

    PubMed Central

    Shrimankar, D. D.; Sathe, S. R.

    2016-01-01

    Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today’s supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures. PMID:27932868

  14. Tools and Techniques for Adding Fault Tolerance to Distributed and Parallel Programs

    DTIC Science & Technology

    1991-12-07

    identify by block number) FIELD IGROUP SUB-GROUP parallel processing , reliability, transactions, checkpoint ing, -- recovery, replication, reliable...we ’survey some of the principal paradigms for fault-tolerant distributed camputing and discuss their relevance to parallel processing . One particular...parallel programnming environment. Keywords: Parallel processing , reliability, transactions, checkpointing, recovery, replication, reli- able broadcast

  15. DOE SBIR Phase-1 Report on Hybrid CPU-GPU Parallel Development of the Eulerian-Lagrangian Barracuda Multiphase Program

    SciTech Connect

    Dr. Dale M. Snider

    2011-02-28

    This report gives the result from the Phase-1 work on demonstrating greater than 10x speedup of the Barracuda computer program using parallel methods and GPU processors (General-Purpose Graphics Processing Unit or Graphics Processing Unit). Phase-1 demonstrated a 12x speedup on a typical Barracuda function using the GPU processor. The problem test case used about 5 million particles and 250,000 Eulerian grid cells. The relative speedup, compared to a single CPU, increases with increased number of particles giving greater than 12x speedup. Phase-1 work provided a path for reformatting data structure modifications to give good parallel performance while keeping a friendly environment for new physics development and code maintenance. The implementation of data structure changes will be in Phase-2. Phase-1 laid the ground work for the complete parallelization of Barracuda in Phase-2, with the caveat that implemented computer practices for parallel programming done in Phase-1 gives immediate speedup in the current Barracuda serial running code. The Phase-1 tasks were completed successfully laying the frame work for Phase-2. The detailed results of Phase-1 are within this document. In general, the speedup of one function would be expected to be higher than the speedup of the entire code because of I/O functions and communication between the algorithms. However, because one of the most difficult Barracuda algorithms was parallelized in Phase-1 and because advanced parallelization methods and proposed parallelization optimization techniques identified in Phase-1 will be used in Phase-2, an overall Barracuda code speedup (relative to a single CPU) is expected to be greater than 10x. This means that a job which takes 30 days to complete will be done in 3 days. Tasks completed in Phase-1 are: Task 1: Profile the entire Barracuda code and select which subroutines are to be parallelized (See Section Choosing a Function to Accelerate) Task 2: Select a GPU consultant company and

  16. Mobile and replicated alignment of arrays in data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert

    1993-01-01

    When a data-parallel language like FORTRAN 90 is compiled for a distributed-memory machine, aggregate data objects (such as arrays) are distributed across the processor memories. The mapping determines the amount of residual communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: first, an alignment that maps all the objects to an abstract template, and then a distribution that maps the template to the processors. We solve two facets of the problem of finding alignments that reduce residual communication: we determine alignments that vary in loops, and objects that should have replicated alignments. We show that loop-dependent mobile alignment is sometimes necessary for optimum performance, and we provide algorithms with which a compiler can determine good mobile alignments for objects within do loops. We also identify situations in which replicated alignment is either required by the program itself (via spread operations) or can be used to improve performance. We propose an algorithm based on network flow that determines which objects to replicate so as to minimize the total amount of broadcast communication in replication. This work on mobile and replicated alignment extends our earlier work on determining static alignment.

  17. The Fortran-P Translator: Towards Automatic Translation of Fortran 77 Programs for Massively Parallel Processors

    DOE PAGES

    O'keefe, Matthew; Parr, Terence; Edgar, B. Kevin; ...

    1995-01-01

    Massively parallel processors (MPPs) hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this article we show how applications codes written in a subset of Fortran 77, called Fortran-P, can be translated to achieve good performance on several massively parallel machines. This subset can express codes that are self-similar, where the algorithm applied to the global data domain is also applied to each subdomain. Wemore » have found many codes that match the Fortran-P programming style and have converted them using our tools. We believe a self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.« less

  18. The effect of preventive educational program in cigarette smoking: Extended Parallel Process Model

    PubMed Central

    Gharlipour, Zabihollah; Hazavehei, Seyed Mohammad Mehdi; Moeini, Babak; Nazari, Mahin; Beigi, Abbas Moghim; Tavassoli, Elahe; Heydarabadi, Akbar Babaei; Reisi, Mahnoush; Barkati, Hasan

    2015-01-01

    Background: Cigarette smoking is one of the preventable causes of diseases and deaths. The most important preventive measure is technique to resist against peer pressure. Any educational program should design with an emphasis upon theories of behavioral change and based on effective educational program. To investigate the interventions through educational program in prevention of cigarette smoking, this paper has used the Extended Parallel Process Model (EPPM). Materials and Methods: This study is a quasi-experimental study. Two middle schools were randomly selected from male students in Shiraz. Therefore, we randomly selected 120 students for the experimental group and 120 students for the control group. After diagnostic evaluation, educational interventions on the consequences of smoking and preventive skills were applied. Results: Our results indicated that there was a significant difference between students in the control and experimental groups in the means of perceived susceptibility (P < 0.000, t = 6.84), perceived severity (P < 0.000, t = −11.46), perceived response efficacy (P < 0.000, t = −7.07), perceived self-efficacy (P < 0.000, t = −11.64), and preventive behavior (P < 0.000, t = −24.36). Conclusions: EPPM along with educating skills necessary to resist against peer pressure had significant level of efficiency in improving preventive behavior of cigarette smoking among adolescents. However, this study recommends further studies on ways of increasing perceived susceptibility in cigarette smoking among adolescents. PMID:25767815

  19. The effect of preventive educational program in cigarette smoking: Extended Parallel Process Model.

    PubMed

    Gharlipour, Zabihollah; Hazavehei, Seyed Mohammad Mehdi; Moeini, Babak; Nazari, Mahin; Beigi, Abbas Moghim; Tavassoli, Elahe; Heydarabadi, Akbar Babaei; Reisi, Mahnoush; Barkati, Hasan

    2015-01-01

    Cigarette smoking is one of the preventable causes of diseases and deaths. The most important preventive measure is technique to resist against peer pressure. Any educational program should design with an emphasis upon theories of behavioral change and based on effective educational program. To investigate the interventions through educational program in prevention of cigarette smoking, this paper has used the Extended Parallel Process Model (EPPM). This study is a quasi-experimental study. Two middle schools were randomly selected from male students in Shiraz. Therefore, we randomly selected 120 students for the experimental group and 120 students for the control group. After diagnostic evaluation, educational interventions on the consequences of smoking and preventive skills were applied. Our results indicated that there was a significant difference between students in the control and experimental groups in the means of perceived susceptibility (P < 0.000, t = 6.84), perceived severity (P < 0.000, t = -11.46), perceived response efficacy (P < 0.000, t = -7.07), perceived self-efficacy (P < 0.000, t = -11.64), and preventive behavior (P < 0.000, t = -24.36). EPPM along with educating skills necessary to resist against peer pressure had significant level of efficiency in improving preventive behavior of cigarette smoking among adolescents. However, this study recommends further studies on ways of increasing perceived susceptibility in cigarette smoking among adolescents.

  20. Parallel conjugate gradient: effects of ordering strategies, programming paradigms, and architectural platforms

    SciTech Connect

    Oliker, L.; Li, X.; Heber, G.; Biswas, R.

    2000-05-01

    The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the floating-point operations with a CG iteration. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and SPMV using different programming and architectures. Results show that for this class of applications, ordering significantly improves overall performance, that cache reuse may be more important than reducing communication, and that it is possible to achieve message passing performance using shared memory constructs through careful data ordering and distribution. However, a multithreaded implementation of CG on the Tera MTA does not require special ordering or partitioning to obtain high efficiency and scalability.

  1. Event-Based Study of the Effect of Execution Environments on Parallel Program Performance

    NASA Technical Reports Server (NTRS)

    Sarukkai, Sekhar R.; Yan, Jerry C.; Craw, James (Technical Monitor)

    1995-01-01

    In this paper we seek to demonstrate the importance of studying the effect of changes in execution environment parameters, on parallel applications executed on state-of-the-art multiprocessors. A comprehensive methodology for event-based analysis of program behavior is introduced. This methodology is used to study the performance significance of various system parameters such as processor speed, message-buffer size, buffer copy speed, network bandwidth, communication latency, interrupt overheads and other system parameters. With the help cf a few CFD examples, we illustrate the use of our technique in determining suitable parameter values of the execution environment for three applications. We also demonstrate how this approach can be used to predict performance across architectures and illustrate the use of visual and profile-like feedback to expose the effect of system parameters changes on the performance of specific applications module.

  2. Parallel Conjugate Gradient: Effects of Ordering Strategies, Programming Paradigms, and Architectural Platforms

    NASA Technical Reports Server (NTRS)

    Oliker, Leonid; Heber, Gerd; Biswas, Rupak

    2000-01-01

    The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the floating-point operations within a CG iteration. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and SPMV using different programming paradigms and architectures. Results show that for this class of applications, ordering significantly improves overall performance, that cache reuse may be more important than reducing communication, and that it is possible to achieve message passing performance using shared memory constructs through careful data ordering and distribution. However, a multi-threaded implementation of CG on the Tera MTA does not require special ordering or partitioning to obtain high efficiency and scalability.

  3. A pattern recognition system for prostate mass spectra discrimination based on the CUDA parallel programming model

    NASA Astrophysics Data System (ADS)

    Kostopoulos, Spiros; Glotsos, Dimitris; Sidiropoulos, Konstantinos; Asvestas, Pantelis; Cavouras, Dionisis; Kalatzis, Ioannis

    2014-03-01

    The aim of the present study was to implement a pattern recognition system for the discrimination of healthy from malignant prostate tumors from proteomic Mass Spectroscopy (MS) samples and to identify m/z intervals of potential biomarkers associated with prostate cancer. One hundred and six MS-spectra were studied in total. Sixty three spectra corresponded to healthy cases (PSA < 1) and forty three spectra were cancerous (PSA > 10). The MS-spectra are publicly available from the NCI Clinical Proteomics Database. The pre-processing comprised the steps: denoising, normalization, peak extraction and peak alignment. Due to the enormous number of features that rose from MS-spectra as informative peaks, and in order to secure optimum system design, the classification task was performed by programming in parallel the multiprocessors of an nVIDIA GPU card, using the CUDA framework. The proposed system achieved 98.1% accuracy. The identified m/z intervals displayed significant statistical differences between the two classes and were found to possess adequate discriminatory power in characterizing prostate samples, when employed in the design of the classification system. Those intervals should be further investigated since they might lead to the identification of potential new biomarkers for prostate cancer.

  4. 3-D parallel program for numerical calculation of gas dynamics problems with heat conductivity on distributed memory computational systems (CS)

    SciTech Connect

    Sofronov, I.D.; Voronin, B.L.; Butnev, O.I.

    1997-12-31

    The aim of the work performed is to develop a 3D parallel program for numerical calculation of gas dynamics problem with heat conductivity on distributed memory computational systems (CS), satisfying the condition of numerical result independence from the number of processors involved. Two basically different approaches to the structure of massive parallel computations have been developed. The first approach uses the 3D data matrix decomposition reconstructed at temporal cycle and is a development of parallelization algorithms for multiprocessor CS with shareable memory. The second approach is based on using a 3D data matrix decomposition not reconstructed during a temporal cycle. The program was developed on 8-processor CS MP-3 made in VNIIEF and was adapted to a massive parallel CS Meiko-2 in LLNL by joint efforts of VNIIEF and LLNL staffs. A large number of numerical experiments has been carried out with different number of processors up to 256 and the efficiency of parallelization has been evaluated in dependence on processor number and their parameters.

  5. ParallelStructure: A R Package to Distribute Parallel Runs of the Population Genetics Program STRUCTURE on Multi-Core Computers

    PubMed Central

    Besnier, Francois; Glover, Kevin A.

    2013-01-01

    This software package provides an R-based framework to make use of multi-core computers when running analyses in the population genetics program STRUCTURE. It is especially addressed to those users of STRUCTURE dealing with numerous and repeated data analyses, and who could take advantage of an efficient script to automatically distribute STRUCTURE jobs among multiple processors. It also consists of additional functions to divide analyses among combinations of populations within a single data set without the need to manually produce multiple projects, as it is currently the case in STRUCTURE. The package consists of two main functions: MPI_structure() and parallel_structure() as well as an example data file. We compared the performance in computing time for this example data on two computer architectures and showed that the use of the present functions can result in several-fold improvements in terms of computation time. ParallelStructure is freely available at https://r-forge.r-project.org/projects/parallstructure/. PMID:23923012

  6. ParallelStructure: a R package to distribute parallel runs of the population genetics program STRUCTURE on multi-core computers.

    PubMed

    Besnier, Francois; Glover, Kevin A

    2013-01-01

    This software package provides an R-based framework to make use of multi-core computers when running analyses in the population genetics program STRUCTURE. It is especially addressed to those users of STRUCTURE dealing with numerous and repeated data analyses, and who could take advantage of an efficient script to automatically distribute STRUCTURE jobs among multiple processors. It also consists of additional functions to divide analyses among combinations of populations within a single data set without the need to manually produce multiple projects, as it is currently the case in STRUCTURE. The package consists of two main functions: MPI_structure() and parallel_structure() as well as an example data file. We compared the performance in computing time for this example data on two computer architectures and showed that the use of the present functions can result in several-fold improvements in terms of computation time. ParallelStructure is freely available at https://r-forge.r-project.org/projects/parallstructure/.

  7. SPSS and SAS programs for determining the number of components using parallel analysis and velicer's MAP test.

    PubMed

    O'Connor, B P

    2000-08-01

    Popular statistical software packages do not have the proper procedures for determining the number of components in factor and principal components analyses. Parallel analysis and Velicer's minimum average partial (MAP) test are validated procedures, recommended widely by statisticians. However, many researchers continue to use alternative, simpler, but flawed procedures, such as the eigenvalues-greater-than-one rule. Use of the proper procedures might be increased if these procedures could be conducted within familiar software environments. This paper describes brief and efficient programs for using SPSS and SAS to conduct parallel analyses and the MAP test.

  8. A component analysis based on serial results analyzing performance of parallel iterative programs

    SciTech Connect

    Richman, S.C.

    1994-12-31

    This research is concerned with the parallel performance of iterative methods for solving large, sparse, nonsymmetric linear systems. Most of the iterative methods are first presented with their time costs and convergence rates examined intensively on sequential machines, and then adapted to parallel machines. The analysis of the parallel iterative performance is more complicated than that of serial performance, since the former can be affected by many new factors, such as data communication schemes, number of processors used, and Ordering and mapping techniques. Although the author is able to summarize results from data obtained after examining certain cases by experiments, two questions remain: (1) How to explain the results obtained? (2) How to extend the results from the certain cases to general cases? To answer these two questions quantitatively, the author introduces a tool called component analysis based on serial results. This component analysis is introduced because the iterative methods consist mainly of several basic functions such as linked triads, inner products, and triangular solves, which have different intrinsic parallelisms and are suitable for different parallel techniques. The parallel performance of each iterative method is first expressed as a weighted sum of the parallel performance of the basic functions that are the components of the method. Then, one separately examines the performance of basic functions and the weighting distributions of iterative methods, from which two independent sets of information are obtained when solving a given problem. In this component approach, all the weightings require only serial costs not parallel costs, and each iterative method for solving a given problem is represented by its unique weighting distribution. The information given by the basic functions is independent of iterative method, while that given by weightings is independent of parallel technique, parallel machine and number of processors.

  9. Lazy task creation: A technique for increasing the granularity of parallel programs. Technical report

    SciTech Connect

    Mohr, E.; Kranz, D.; Halstead, R.H.

    1991-06-01

    Many parallel algorithms are naturally expressed at a fine level of granularity, often finer than a MIMD parallel system can exploit efficiently. Most builders of parallel systems have looked to either the programmer or a parallelizing compiler to increase the granularity of such algorithms. In this paper, the authors explore a third approach to the granularity problem by analyzing two strategies for combining parallel tasks dynamically at run-time. They reject the simpler load-based inlining method, where tasks are combined based on dynamic load level, in favor of the safer and more robust lazy task creation method, where tasks are created only retroactively as processing resources become available. These strategies grew out of work on Mul-T 17, an efficient parallel implementation of Scheme, but could be used with other languages as well. They describe our Mul-T implementations of lazy task creation for two contrasting machines, and present performance statistics which show the method's effectiveness. Lazy task creation allows efficient execution of naturally expressed algorithms of a substantially finer grain than possible with previous parallel Lisp systems.

  10. Lazy task creation; A technique for increasing the granularity of parallel programs

    SciTech Connect

    Mohr, E. ); Kranz, D.A. ); Halstead, R.H. Jr. )

    1991-07-01

    Many parallel algorithms are naturally expressed at a fine level of granularity, often finer than a MIMD parallel system can exploit efficiently. Most builders of parallel systems have looked to either the programmer or a parallelizing compiler to increase the granularity of such algorithms. In this paper the authors explore a third approach to the granularity problem by analyzing two strategies for combining parallel tasks dynamically at runtime. The authors reject the simpler load-based inlining method, where tasks are combined based on dynamic load level, in favor of the safer and more robust lazy task creation method, where tasks are created only retroactively as processing resources become available. These strategies grew out of work on Mul-T, an efficient parallel implementation of Scheme, but could be used with other languages as well. The authors describe our Mul-T implementations of lazy task creation for two contrasting machines, and present performance statistics which show the method's effectiveness. Lazy task creation allows efficient execution of naturally expressed algorithms of a substantially finer grain than possible with previous parallel Lisp systems.

  11. PIPS-SBB: A Parallel Distributed-Memory Branch-and-Bound Algorithm for Stochastic Mixed-Integer Programs

    DOE PAGES

    Munguia, Lluis-Miquel; Oxberry, Geoffrey; Rajan, Deepak

    2016-05-01

    Stochastic mixed-integer programs (SMIPs) deal with optimization under uncertainty at many levels of the decision-making process. When solved as extensive formulation mixed- integer programs, problem instances can exceed available memory on a single workstation. In order to overcome this limitation, we present PIPS-SBB: a distributed-memory parallel stochastic MIP solver that takes advantage of parallelism at multiple levels of the optimization process. We also show promising results on the SIPLIB benchmark by combining methods known for accelerating Branch and Bound (B&B) methods with new ideas that leverage the structure of SMIPs. Finally, we expect the performance of PIPS-SBB to improve furthermore » as more functionality is added in the future.« less

  12. Investigation of the applicability of a functional programming model to fault-tolerant parallel processing for knowledge-based systems

    NASA Technical Reports Server (NTRS)

    Harper, Richard

    1989-01-01

    In a fault-tolerant parallel computer, a functional programming model can facilitate distributed checkpointing, error recovery, load balancing, and graceful degradation. Such a model has been implemented on the Draper Fault-Tolerant Parallel Processor (FTPP). When used in conjunction with the FTPP's fault detection and masking capabilities, this implementation results in a graceful degradation of system performance after faults. Three graceful degradation algorithms have been implemented and are presented. A user interface has been implemented which requires minimal cognitive overhead by the application programmer, masking such complexities as the system's redundancy, distributed nature, variable complement of processing resources, load balancing, fault occurrence and recovery. This user interface is described and its use demonstrated. The applicability of the functional programming style to the Activation Framework, a paradigm for intelligent systems, is then briefly described.

  13. PIPS-SBB: A Parallel Distributed-Memory Branch-and-Bound Algorithm for Stochastic Mixed-Integer Programs

    SciTech Connect

    Munguia, Lluis-Miquel; Oxberry, Geoffrey; Rajan, Deepak

    2016-05-01

    Stochastic mixed-integer programs (SMIPs) deal with optimization under uncertainty at many levels of the decision-making process. When solved as extensive formulation mixed- integer programs, problem instances can exceed available memory on a single workstation. In order to overcome this limitation, we present PIPS-SBB: a distributed-memory parallel stochastic MIP solver that takes advantage of parallelism at multiple levels of the optimization process. We also show promising results on the SIPLIB benchmark by combining methods known for accelerating Branch and Bound (B&B) methods with new ideas that leverage the structure of SMIPs. Finally, we expect the performance of PIPS-SBB to improve further as more functionality is added in the future.

  14. PIPS-SBB: A Parallel Distributed-Memory Branch-and-Bound Algorithm for Stochastic Mixed-Integer Programs

    SciTech Connect

    Munguia, Lluis-Miquel; Oxberry, Geoffrey; Rajan, Deepak

    2016-05-01

    Stochastic mixed-integer programs (SMIPs) deal with optimization under uncertainty at many levels of the decision-making process. When solved as extensive formulation mixed- integer programs, problem instances can exceed available memory on a single workstation. In order to overcome this limitation, we present PIPS-SBB: a distributed-memory parallel stochastic MIP solver that takes advantage of parallelism at multiple levels of the optimization process. We also show promising results on the SIPLIB benchmark by combining methods known for accelerating Branch and Bound (B&B) methods with new ideas that leverage the structure of SMIPs. Finally, we expect the performance of PIPS-SBB to improve further as more functionality is added in the future.

  15. ParaHaplo: A program package for haplotype-based whole-genome association study using parallel computing.

    PubMed

    Misawa, Kazuharu; Kamatani, Naoyuki

    2009-10-21

    Since more than a million single-nucleotide polymorphisms (SNPs) are analyzed in any given genome-wide association study (GWAS), performing multiple comparisons can be problematic. To cope with multiple-comparison problems in GWAS, haplotype-based algorithms were developed to correct for multiple comparisons at multiple SNP loci in linkage disequilibrium. A permutation test can also control problems inherent in multiple testing; however, both the calculation of exact probability and the execution of permutation tests are time-consuming. Faster methods for calculating exact probabilities and executing permutation tests are required. We developed a set of computer programs for the parallel computation of accurate P-values in haplotype-based GWAS. Our program, ParaHaplo, is intended for workstation clusters using the Intel Message Passing Interface (MPI). We compared the performance of our algorithm to that of the regular permutation test on JPT and CHB of HapMap. ParaHaplo can detect smaller differences between 2 populations than SNP-based GWAS. We also found that parallel-computing techniques made ParaHaplo 100-fold faster than a non-parallel version of the program. ParaHaplo is a useful tool in conducting haplotype-based GWAS. Since the data sizes of such projects continue to increase, the use of fast computations with parallel computing--such as that used in ParaHaplo--will become increasingly important. The executable binaries and program sources of ParaHaplo are available at the following address: http://sourceforge.jp/projects/parallelgwas/?_sl=1.

  16. CLUSTEREASY: A program for lattice simulations of scalar fields in an expanding universe on parallel computing clusters

    NASA Astrophysics Data System (ADS)

    Felder, Gary

    2008-10-01

    We describe an MPI C++ program that we have written and made available for calculating the evolution of interacting scalar fields in an expanding universe on parallel clusters. The program is a parallel programming extension of the simulation program LATTICEEASY. The ability to run these simulations on parallel clusters, however, greatly extends the range of scales and times that can be simulated. The program is particularly useful for the study of reheating and thermalization after inflation. The program and its full documentation are available on the Web at http://www.science.smith.edu/departments/Physics/fstaff/gfelder/latticeeasy/. In this paper we provide a brief overview of what the program does and what it is useful for. Catalogue identifier: AEBJ_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEBJ_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 7469 No. of bytes in distributed program, including test data, etc.: 613 334 Distribution format: tar.gz Programming language: C++/MPI Computer: Cluster. Must have the library FFTW installed Operating system: Any RAM: Typically 4 MB to 1 GB per processor Classification: 1.9 External routines: A single-precision version of the FFTW library (http://www.fftw.org/) must be available on the target machine. Nature of problem: After inflation the universe consisted of interacting fields in a high energy, nonthermal state [1]. The evolution of these fields cannot be described with standard approximation techniques such as linearization, kinetic theory, or Hartree expansion, and must thus be simulated numerically. Fortunately, the fields rapidly acquire large occupation numbers over a range of frequencies, so their evolution can be accurately modeled with classical field theory [2]. The specific fields and

  17. Scalable parallel programming for high performance seismic simulation on petascale heterogeneous supercomputers

    NASA Astrophysics Data System (ADS)

    Zhou, Jun

    The 1994 Northridge earthquake in Los Angeles, California, killed 57 people, injured over 8,700 and caused an estimated $20 billion in damage. Petascale simulations are needed in California and elsewhere to provide society with a better understanding of the rupture and wave dynamics of the largest earthquakes at shaking frequencies required to engineer safe structures. As the heterogeneous supercomputing infrastructures are becoming more common, numerical developments in earthquake system research are particularly challenged by the dependence on the accelerator elements to enable "the Big One" simulations with higher frequency and finer resolution. Reducing time to solution and power consumption are two primary focus area today for the enabling technology of fault rupture dynamics and seismic wave propagation in realistic 3D models of the crust's heterogeneous structure. This dissertation presents scalable parallel programming techniques for high performance seismic simulation running on petascale heterogeneous supercomputers. A real world earthquake simulation code, AWP-ODC, one of the most advanced earthquake codes to date, was chosen as the base code in this research, and the testbed is based on Titan at Oak Ridge National Laboraratory, the world's largest hetergeneous supercomputer. The research work is primarily related to architecture study, computation performance tuning and software system scalability. An earthquake simulation workflow has also been developed to support the efficient production sets of simulations. The highlights of the technical development are an aggressive performance optimization focusing on data locality and a notable data communication model that hides the data communication latency. This development results in the optimal computation efficiency and throughput for the 13-point stencil code on heterogeneous systems, which can be extended to general high-order stencil codes. Started from scratch, the hybrid CPU/GPU version of AWP

  18. Parallel Object Oriented MD Simulation Program for Long Time Simulations of Metallic Glasses and Undercooled Liquids

    NASA Astrophysics Data System (ADS)

    Böddeker, B.; Teichler, H.

    The MD simulation program TABB is motivated by the need of long time simulations for the investigation of slow processes near the glass transition of glass forming alloys. TABB is written in C++ with a high degree of flexibility: TABB allows the use of any short ranged pair potentials or EAM potentials, by generating and using a spline representation of all functions and their derivatives. TABB supports several numerical integration algorithms like the Runge-Kotta or the modified Gear-predictor-corrector algorithm of order five. The boundary conditions can be chosen to resemble the geometry of bulk materials or films. The simulation box length or the pressure can be fixed for each dimension separately. TABB may be used in isokinetic, isoenergeric or canonic (with random forces) mode. TABB contains a simple instruction interpreter to easily control the parameters and options during the simulation. The same source code can be compiled either for workstations or for parallel computers. The main optimization goal of TABB is to allow long time simulations of medium or small sized systems. To make this possible, much attention is spent on the optimized communication between the nodes. TABB uses a domain decomposition procedure. To use many nodes with a small system, the domain size has to be small compared to the range of particle interactions. In the limit of many nodes for only few atoms, the bottle neck of communication is the latency time. TABB minimizes the number of pairs of domains containing atoms that interact between these domains. This procedure minimizes the need of communication calls between pairs of nodes. TABB decides automatically, to how many, and to which directions the decomposition shall be applied. E.g., in the case of one dimensional domain decomposition, the simulation box is only split into "slabs" along a selected direction. The three dimensional domain decomposition is best with respect to the number of interacting domains only for simulations

  19. Solar cell welded interconnection development program. [parallel gap and ultrasonic metal-metal bonding

    NASA Technical Reports Server (NTRS)

    Katzeff, J. S.

    1974-01-01

    Parallel gap welding and ultrasonic bonding techniques were developed for joining selected interconnect materials (silver, aluminum, copper, silver plated molybdenum and Kovar) to silver-titanium and aluminum contact cells. All process variables have been evaluated leading to establishment of optimum solar cell, interconnect, electrodes and equipment criteria for obtainment of consistent high quality welds. Applicability of nondestructive testing of solar cell welds has been studied. A pre-weld monitoring system is being built and will be utilized in the numerically controlled parallel gap weld station.

  20. Parallel implementation of inverse adding-doubling and Monte Carlo multi-layered programs for high performance computing systems with shared and distributed memory

    NASA Astrophysics Data System (ADS)

    Chugunov, Svyatoslav; Li, Changying

    2015-09-01

    Parallel implementation of two numerical tools popular in optical studies of biological materials-Inverse Adding-Doubling (IAD) program and Monte Carlo Multi-Layered (MCML) program-was developed and tested in this study. The implementation was based on Message Passing Interface (MPI) and standard C-language. Parallel versions of IAD and MCML programs were compared to their sequential counterparts in validation and performance tests. Additionally, the portability of the programs was tested using a local high performance computing (HPC) cluster, Penguin-On-Demand HPC cluster, and Amazon EC2 cluster. Parallel IAD was tested with up to 150 parallel cores using 1223 input datasets. It demonstrated linear scalability and the speedup was proportional to the number of parallel cores (up to 150x). Parallel MCML was tested with up to 1001 parallel cores using problem sizes of 104-109 photon packets. It demonstrated classical performance curves featuring communication overhead and performance saturation point. Optimal performance curve was derived for parallel MCML as a function of problem size. Typical speedup achieved for parallel MCML (up to 326x) demonstrated linear increase with problem size. Precision of MCML results was estimated in a series of tests - problem size of 106 photon packets was found optimal for calculations of total optical response and 108 photon packets for spatially-resolved results. The presented parallel versions of MCML and IAD programs are portable on multiple computing platforms. The parallel programs could significantly speed up the simulation for scientists and be utilized to their full potential in computing systems that are readily available without additional costs.

  1. Implementing the PM Programming Language using MPI and OpenMP - a New Tool for Programming Geophysical Models on Parallel Systems

    NASA Astrophysics Data System (ADS)

    Bellerby, Tim

    2015-04-01

    PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks < number of processors

  2. Strength analysis of parallel robot components in PLM Siemens NX 8.5 program

    NASA Astrophysics Data System (ADS)

    Ociepka, P.; Herbus, K.

    2015-11-01

    This article presents a series of numerical analyses in order to identify the states of stress in elements, which arise during the operation of the mechanism. The object of the research was parallel robot, which is the basis for the prototype of a driving simulator. To conduct the dynamic analysis was used the Motion Simulation module and the RecurDyn solver. In this module were created the joints which occur in the mechanism of a parallel robot. Next dynamic analyzes were performed to determine the maximal forces that will applied to the analyzed elements. It was also analyzed the platform motion during the simulation a collision of a car with a wall. In the next step, basing on the results obtained in the dynamic analysis, were performed the strength analyzes in the Advanced Simulation module. For calculation the NX Nastran solver was used.

  3. Application Portable Parallel Library

    NASA Technical Reports Server (NTRS)

    Cole, Gary L.; Blech, Richard A.; Quealy, Angela; Townsend, Scott

    1995-01-01

    Application Portable Parallel Library (APPL) computer program is subroutine-based message-passing software library intended to provide consistent interface to variety of multiprocessor computers on market today. Minimizes effort needed to move application program from one computer to another. User develops application program once and then easily moves application program from parallel computer on which created to another parallel computer. ("Parallel computer" also include heterogeneous collection of networked computers). Written in C language with one FORTRAN 77 subroutine for UNIX-based computers and callable from application programs written in C language or FORTRAN 77.

  4. Application Portable Parallel Library

    NASA Technical Reports Server (NTRS)

    Cole, Gary L.; Blech, Richard A.; Quealy, Angela; Townsend, Scott

    1995-01-01

    Application Portable Parallel Library (APPL) computer program is subroutine-based message-passing software library intended to provide consistent interface to variety of multiprocessor computers on market today. Minimizes effort needed to move application program from one computer to another. User develops application program once and then easily moves application program from parallel computer on which created to another parallel computer. ("Parallel computer" also include heterogeneous collection of networked computers). Written in C language with one FORTRAN 77 subroutine for UNIX-based computers and callable from application programs written in C language or FORTRAN 77.

  5. Temperature Control with Two Parallel Small Loop Heat Pipes for GLM Program

    NASA Technical Reports Server (NTRS)

    Khrustalev, Dmitry; Stouffer, Chuck; Ku, Jentung; Hamilton, Jon; Anderson, Mark

    2014-01-01

    The concept of temperature control of an electronic component using a single Loop Heat Pipe (LHP) is well established for Aerospace applications. Using two LHPs is often desirable for redundancy/reliability reasons or for increasing the overall heat source-sink thermal conductance. This effort elaborates on temperature controlling operation of a thermal system that includes two small ammonia LHPs thermally coupled together at the evaporator end as well as at the condenser end and operating "in parallel". A transient model of the LHP system was developed on the Thermal Desktop (TradeMark) platform to understand some fundamental details of such parallel operation of the two LHPs. Extensive thermal-vacuum testing was conducted with two thermally coupled LHPs operating simultaneously as well as with only one LHP operating at a time. This paper outlines the temperature control procedures for two LHPs operating simultaneously with widely varying sink temperatures. The test data obtained during the thermal-vacuum testing, with both LHPs running simultaneously in comparison with only one LHP operating at a time, are presented with detailed explanations.

  6. A Study on the Effect of Communication Performance on Message-Passing Parallel Programs: Methodology and Case Studies

    NASA Technical Reports Server (NTRS)

    Sarukkai, Sekhar R.; Yan, Jerry; Woodrow, Thomas (Technical Monitor)

    1994-01-01

    From a source-program perspective, the performance achieved on distributed/parallel systems is governed by the underlying message-passing library overhead and the network capabilities of the architecture. Studying the impact of changes in these features on the source-program. can have a significant influence in the development of next-generation system designs. In this paper we introduce a simple and robust tool that can be used for this purpose. This tool is based on event-driven simulation of programs that generates a new set of trace events - that preserves causality and partial order - corresponding to the expected execution of the program in the simulated environment. Trace events can be visualized and source-level profile information can be used to pin-point locations of program which are most significantly affected with changing system parameters in the simulated environment. We present a number of examples from the NAS benchmark suite, executed on the Intel Paragon and iPSC/860 that are used to identify and expose performance bottlenecks with varying system parameters. Specific aspects of the system that significantly effect these benchmarks are presented and discussed,

  7. Neurite, a Finite Difference Large Scale Parallel Program for the Simulation of Electrical Signal Propagation in Neurites under Mechanical Loading

    PubMed Central

    García-Grajales, Julián A.; Rucabado, Gabriel; García-Dopico, Antonio; Peña, José-María; Jérusalem, Antoine

    2015-01-01

    With the growing body of research on traumatic brain injury and spinal cord injury, computational neuroscience has recently focused its modeling efforts on neuronal functional deficits following mechanical loading. However, in most of these efforts, cell damage is generally only characterized by purely mechanistic criteria, functions of quantities such as stress, strain or their corresponding rates. The modeling of functional deficits in neurites as a consequence of macroscopic mechanical insults has been rarely explored. In particular, a quantitative mechanically based model of electrophysiological impairment in neuronal cells, Neurite, has only very recently been proposed. In this paper, we present the implementation details of this model: a finite difference parallel program for simulating electrical signal propagation along neurites under mechanical loading. Following the application of a macroscopic strain at a given strain rate produced by a mechanical insult, Neurite is able to simulate the resulting neuronal electrical signal propagation, and thus the corresponding functional deficits. The simulation of the coupled mechanical and electrophysiological behaviors requires computational expensive calculations that increase in complexity as the network of the simulated cells grows. The solvers implemented in Neurite—explicit and implicit—were therefore parallelized using graphics processing units in order to reduce the burden of the simulation costs of large scale scenarios. Cable Theory and Hodgkin-Huxley models were implemented to account for the electrophysiological passive and active regions of a neurite, respectively, whereas a coupled mechanical model accounting for the neurite mechanical behavior within its surrounding medium was adopted as a link between electrophysiology and mechanics. This paper provides the details of the parallel implementation of Neurite, along with three different application examples: a long myelinated axon, a segmented

  8. Generating local addresses and communication sets for data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Long, Fred J. E.; Schreiber, Robert; Teng, Shang-Hua

    1993-01-01

    Generating local addresses and communication sets is an important issue in distributed-memory implementations of data-parallel languages such as High Performance FORTRAN. We show that, for an array A affinely aligned to a template that is distributed across p processors with a cyclic(k) distribution and a computation involving the regular section A(l:h:s), the local memory access sequence for any processor is characterized by a finite state machine of at most k states. We present fast algorithms for computing the essential information about these state machines, and extend the framework to handle multidimensional arrays. We also show how to generate communication sets using the state machine approach. Performance results show that this solution requires very little run-time overhead and acceptable preprocessing time.

  9. Generating local addresses and communication sets for data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Long, Fred J. E.; Schreiber, Robert; Teng, Shang-Hua

    1993-01-01

    Generating local addresses and communication sets is an important issue in distributed-memory implementations of data-parallel languages such as High Performance Fortran. We show that for an array A affinely aligned to a template that is distributed across p processors with a cyclic(k) distribution, and a computation involving the regular section A, the local memory access sequence for any processor is characterized by a finite state machine of at most k states. We present fast algorithms for computing the essential information about these state machines, and extend the framework to handle multidimensional arrays. We also show how to generate communication sets using the state machine approach. Performance results show that this solution requires very little runtime overhead and acceptable preprocessing time.

  10. Portable Parallel Programming for the Dynamic Load Balancing of Unstructured Grid Applications

    NASA Technical Reports Server (NTRS)

    Biswas, Rupak; Das, Sajal K.; Harvey, Daniel; Oliker, Leonid

    1999-01-01

    The ability to dynamically adapt an unstructured -rid (or mesh) is a powerful tool for solving computational problems with evolving physical features; however, an efficient parallel implementation is rather difficult, particularly from the view point of portability on various multiprocessor platforms We address this problem by developing PLUM, tin automatic anti architecture-independent framework for adaptive numerical computations in a message-passing environment. Portability is demonstrated by comparing performance on an SP2, an Origin2000, and a T3E, without any code modifications. We also present a general-purpose load balancer that utilizes symmetric broadcast networks (SBN) as the underlying communication pattern, with a goal to providing a global view of system loads across processors. Experiments on, an SP2 and an Origin2000 demonstrate the portability of our approach which achieves superb load balance at the cost of minimal extra overhead.

  11. Parallel programming of gradient-based iterative image reconstruction schemes for optical tomography.

    PubMed

    Hielscher, Andreas H; Bartel, Sebastian

    2004-02-01

    Optical tomography (OT) is a fast developing novel imaging modality that uses near-infrared (NIR) light to obtain cross-sectional views of optical properties inside the human body. A major challenge remains the time-consuming, computational-intensive image reconstruction problem that converts NIR transmission measurements into cross-sectional images. To increase the speed of iterative image reconstruction schemes that are commonly applied for OT, we have developed and implemented several parallel algorithms on a cluster of workstations. Static process distribution as well as dynamic load balancing schemes suitable for heterogeneous clusters and varying machine performances are introduced and tested. The resulting algorithms are shown to accelerate the reconstruction process to various degrees, substantially reducing the computation times for clinically relevant problems.

  12. A parallel program for numerical simulation of discrete fracture network and groundwater flow

    NASA Astrophysics Data System (ADS)

    Huang, Ting-Wei; Liou, Tai-Sheng; Kalatehjari, Roohollah

    2017-04-01

    The ability of modeling fluid flow in Discrete Fracture Network (DFN) is critical to various applications such as exploration of reserves in geothermal and petroleum reservoirs, geological sequestration of carbon dioxide and final disposal of spent nuclear fuels. Although several commerical or acdametic DFN flow simulators are already available (e.g., FracMan and DFNWORKS), challenges in terms of computational efficiency and three-dimensional visualization still remain, which therefore motivates this study for developing a new DFN and flow simulator. A new DFN and flow simulator, DFNbox, was written in C++ under a cross-platform software development framework provided by Qt. DFNBox integrates the following capabilities into a user-friendly drop-down menu interface: DFN simulation and clipping, 3D mesh generation, fracture data analysis, connectivity analysis, flow path analysis and steady-state grounwater flow simulation. All three-dimensional visualization graphics were developed using the free OpenGL API. Similar to other DFN simulators, fractures are conceptualized as random point process in space, with stochastic characteristics represented by orientation, size, transmissivity and aperture. Fracture meshing was implemented by Delaunay triangulation for visualization but not flow simulation purposes. Boundary element method was used for flow simulations such that only unknown head or flux along exterior and interection bounaries are needed for solving the flow field in the DFN. Parallel compuation concept was taken into account in developing DFNbox for calculations that such concept is possible. For example, the time-consuming seqential code for fracture clipping calculations has been completely replaced by a highly efficient parallel one. This can greatly enhance compuational efficiency especially on multi-thread platforms. Furthermore, DFNbox have been successfully tested in Windows and Linux systems with equally-well performance.

  13. Parallel and Distributed Computing.

    DTIC Science & Technology

    1986-12-12

    program was devoted to parallel and distributed computing . Support for this part of the program was obtained from the present Army contract and a...Umesh Vazirani. A workshop on parallel and distributed computing was held from May 19 to May 23, 1986 and drew 141 participants. Keywords: Mathematical programming; Protocols; Randomized algorithms. (Author)

  14. Parallel Debugging Using Graphical Views

    DTIC Science & Technology

    1988-03-01

    Voyeur , a prototype system for creating graphical views of parallel programs, provid(s a cost-effective way to construct such views for any parallel...programming system. We illustrate Voyeur by discussing four views created for debugging Poker programs. One is a vteneral trace facility for any Poker...Graphical views are essential for debugging parallel programs because of the large quan- tity of state information contained in parallel programs. Voyeur

  15. Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages

    DTIC Science & Technology

    2013-01-02

    68 9.3.2 Multicore-specific Optimizations and Code Generation . . . . . . . . . . . 69 9.3.3 CUDA ...11.2 Visitor function for BFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 11.3 Vertex function for the BSP model...Algebra Subroutines CMOS Complementary Metal-Oxide Semiconductor CPU Central Processing Unit CUDA Programming platform for Nvidia GPUs DFS Depth-First

  16. Acceleration of the Geostatistical Software Library (GSLIB) by code optimization and hybrid parallel programming

    NASA Astrophysics Data System (ADS)

    Peredo, Oscar; Ortiz, Julián M.; Herrero, José R.

    2015-12-01

    The Geostatistical Software Library (GSLIB) has been used in the geostatistical community for more than thirty years. It was designed as a bundle of sequential Fortran codes, and today it is still in use by many practitioners and researchers. Despite its widespread use, few attempts have been reported in order to bring this package to the multi-core era. Using all CPU resources, GSLIB algorithms can handle large datasets and grids, where tasks are compute- and memory-intensive applications. In this work, a methodology is presented to accelerate GSLIB applications using code optimization and hybrid parallel processing, specifically for compute-intensive applications. Minimal code modifications are added decreasing as much as possible the elapsed time of execution of the studied routines. If multi-core processing is available, the user can activate OpenMP directives to speed up the execution using all resources of the CPU. If multi-node processing is available, the execution is enhanced using MPI messages between the compute nodes.Four case studies are presented: experimental variogram calculation, kriging estimation, sequential gaussian and indicator simulation. For each application, three scenarios (small, large and extra large) are tested using a desktop environment with 4 CPU-cores and a multi-node server with 128 CPU-nodes. Elapsed times, speedup and efficiency results are shown.

  17. What Multilevel Parallel Programs do when you are not Watching: A Performance Analysis Case Study Comparing MPI/OpenMP, MLP, and Nested OpenMP

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Labarta, Jesus; Gimenez, Judit

    2004-01-01

    With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors, parallel programming techniques have evolved that support parallelism beyond a single level. When comparing the performance of applications based on different programming paradigms, it is important to differentiate between the influence of the programming model itself and other factors, such as implementation specific behavior of the operating system (OS) or architectural issues. Rewriting-a large scientific application in order to employ a new programming paradigms is usually a time consuming and error prone task. Before embarking on such an endeavor it is important to determine that there is really a gain that would not be possible with the current implementation. A detailed performance analysis is crucial to clarify these issues. The multilevel programming paradigms considered in this study are hybrid MPI/OpenMP, MLP, and nested OpenMP. The hybrid MPI/OpenMP approach is based on using MPI [7] for the coarse grained parallelization and OpenMP [9] for fine grained loop level parallelism. The MPI programming paradigm assumes a private address space for each process. Data is transferred by explicitly exchanging messages via calls to the MPI library. This model was originally designed for distributed memory architectures but is also suitable for shared memory systems. The second paradigm under consideration is MLP which was developed by Taft. The approach is similar to MPi/OpenMP, using a mix of coarse grain process level parallelization and loop level OpenMP parallelization. As it is the case with MPI, a private address space is assumed for each process. The MLP approach was developed for ccNUMA architectures and explicitly takes advantage of the availability of shared memory. A shared memory arena which is accessible by all processes is required. Communication is done by reading from and writing to the shared memory.

  18. Parallel embryonic transcriptional programs evolve under distinct constraints and may enable morphological conservation amidst adaptation.

    PubMed

    Malik, Assaf; Gildor, Tsvia; Sher, Noa; Layous, Majed; Ben-Tabou de-Leon, Smadar

    2017-10-01

    Embryonic development evolves by balancing stringent morphological constraints with genetic and environmental variation. The design principle that allows developmental transcriptional programs to conserve embryonic morphology while adapting to environmental changes is still not fully understood. To address this fundamental challenge, we compare developmental transcriptomes of two sea urchin species, Paracentrotus lividus and Strongylocentrotus purpuratus, that shared a common ancestor about 40 million years ago and are geographically distant yet show similar morphology. We find that both developmental and housekeeping genes show highly dynamic and strongly conserved temporal expression patterns. The expression of other gene sets, including homeostasis and response genes, show divergent expression which could result from either evolutionary drift or adaptation to local environmental conditions. The interspecies correlations of developmental gene expressions are highest between morphologically similar developmental time points whereas the interspecies correlations of housekeeping gene expression are high between all the late zygotic time points. Relatedly, the position of the phylotypic stage varies between these two groups of genes: developmental gene expression shows highest conservation at mid-developmental stage, in agreement with the hourglass model while the conservation of housekeeping genes keeps increasing with developmental time. When all genes are combined, the relationship between conservation of gene expression and morphological similarity is partially masked by housekeeping genes and genes with diverged expression. Our study illustrates various transcriptional programs that coexist in the developing embryo and evolve under different constraints. Apparently, morphological constraints underlie the conservation of developmental gene expression while embryonic fitness requires the conservation of housekeeping gene expression and the species

  19. Projection-free parallel quadratic programming for linear model predictive control

    NASA Astrophysics Data System (ADS)

    Di Cairano, S.; Brand, M.; Bortoff, S. A.

    2013-08-01

    A key component in enabling the application of model predictive control (MPC) in fields such as automotive, aerospace, and factory automation is the availability of low-complexity fast optimisation algorithms to solve the MPC finite horizon optimal control problem in architectures with reduced computational capabilities. In this paper, we introduce a projection-free iterative optimisation algorithm and discuss its application to linear MPC. The algorithm, originally developed by Brand for non-negative quadratic programs, is based on a multiplicative update rule and it is shown to converge to a fixed point which is the optimum. An acceleration technique based on a projection-free line search is also introduced, to speed-up the convergence to the optimum. The algorithm is applied to MPC through the dual of the quadratic program (QP) formulated from the MPC finite time optimal control problem. We discuss how termination conditions with guaranteed degree of suboptimality can be enforced, and how the algorithm performance can be optimised by pre-computing the matrices in a parametric form. We show computational results of the algorithm in three common case studies and we compare such results with the results obtained by other available free and commercial QP solvers.

  20. Parallel Lisp simulator

    SciTech Connect

    Weening, J.S.

    1988-05-01

    CSIM is a simulator for parallel Lisp, based on a continuation passing interpreter. It models a shared-memory multiprocessor executing programs written in Common Lisp, extended with several primitives for creating and controlling processes. This paper describes the structure of the simulator, measures its performance, and gives an example of its use with a parallel Lisp program.

  1. Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures

    PubMed Central

    Suchard, Marc A.; Wang, Quanli; Chan, Cliburn; Frelinger, Jacob; Cron, Andrew; West, Mike

    2010-01-01

    This article describes advances in statistical computation for large-scale data analysis in structured Bayesian mixture models via graphics processing unit (GPU) programming. The developments are partly motivated by computational challenges arising in fitting models of increasing heterogeneity to increasingly large datasets. An example context concerns common biological studies using high-throughput technologies generating many, very large datasets and requiring increasingly high-dimensional mixture models with large numbers of mixture components. We outline important strategies and processes for GPU computation in Bayesian simulation and optimization approaches, give examples of the benefits of GPU implementations in terms of processing speed and scale-up in ability to analyze large datasets, and provide a detailed, tutorial-style exposition that will benefit readers interested in developing GPU-based approaches in other statistical models. Novel, GPU-oriented approaches to modifying existing algorithms software design can lead to vast speed-up and, critically, enable statistical analyses that presently will not be performed due to compute time limitations in traditional computational environments. Supplemental materials are provided with all source code, example data, and details that will enable readers to implement and explore the GPU approach in this mixture modeling context. PMID:20877443

  2. NASP: a parallel program for identifying evolutionarily conserved nucleic acid secondary structures from nucleotide sequence alignments.

    PubMed

    Semegni, J Y; Wamalwa, M; Gaujoux, R; Harkins, G W; Gray, A; Martin, D P

    2011-09-01

    Many natural nucleic acid sequences have evolutionarily conserved secondary structures with diverse biological functions. A reliable computational tool for identifying such structures would be very useful in guiding experimental analyses of their biological functions. NASP (Nucleic Acid Structure Predictor) is a program that takes into account thermodynamic stability, Boltzmann base pair probabilities, alignment uncertainty, covarying sites and evolutionary conservation to identify biologically relevant secondary structures within multiple sequence alignments. Unique to NASP is the consideration of all this information together with a recursive permutation-based approach to progressively identify and list the most conserved probable secondary structures that are likely to have the greatest biological relevance. By focusing on identifying only evolutionarily conserved structures, NASP forgoes the prediction of complete nucleotide folds but outperforms various other secondary structure prediction methods in its ability to selectively identify actual base pairings. Downloable and web-based versions of NASP are freely available at http://web.cbio.uct.ac.za/~yves/nasp_portal.php yves@cbio.uct.ac.za Supplementary data are available at Bioinformatics online.

  3. Center for Programming Models for Scalable Parallel Computing - Towards Enhancing OpenMP for Manycore and Heterogeneous Nodes

    SciTech Connect

    Barbara Chapman

    2012-02-01

    OpenMP was not well recognized at the beginning of the project, around year 2003, because of its limited use in DoE production applications and the inmature hardware support for an efficient implementation. Yet in the recent years, it has been graduately adopted both in HPC applications, mostly in the form of MPI+OpenMP hybrid code, and in mid-scale desktop applications for scientific and experimental studies. We have observed this trend and worked deligiently to improve our OpenMP compiler and runtimes, as well as to work with the OpenMP standard organization to make sure OpenMP are evolved in the direction close to DoE missions. In the Center for Programming Models for Scalable Parallel Computing project, the HPCTools team at the University of Houston (UH), directed by Dr. Barbara Chapman, has been working with project partners, external collaborators and hardware vendors to increase the scalability and applicability of OpenMP for multi-core (and future manycore) platforms and for distributed memory systems by exploring different programming models, language extensions, compiler optimizations, as well as runtime library support.

  4. Introducing PROFESS 2.0: A parallelized, fully linear scaling program for orbital-free density functional theory calculations

    NASA Astrophysics Data System (ADS)

    Hung, Linda; Huang, Chen; Shin, Ilgyou; Ho, Gregory S.; Lignères, Vincent L.; Carter, Emily A.

    2010-12-01

    Orbital-free density functional theory (OFDFT) is a first principles quantum mechanics method to find the ground-state energy of a system by variationally minimizing with respect to the electron density. No orbitals are used in the evaluation of the kinetic energy (unlike Kohn-Sham DFT), and the method scales nearly linearly with the size of the system. The PRinceton Orbital-Free Electronic Structure Software (PROFESS) uses OFDFT to model materials from the atomic scale to the mesoscale. This new version of PROFESS allows the study of larger systems with two significant changes: PROFESS is now parallelized, and the ion-electron and ion-ion terms scale quasilinearly, instead of quadratically as in PROFESS v1 (L. Hung and E.A. Carter, Chem. Phys. Lett. 475 (2009) 163). At the start of a run, PROFESS reads the various input files that describe the geometry of the system (ion positions and cell dimensions), the type of elements (defined by electron-ion pseudopotentials), the actions you want it to perform (minimize with respect to electron density and/or ion positions and/or cell lattice vectors), and the various options for the computation (such as which functionals you want it to use). Based on these inputs, PROFESS sets up a computation and performs the appropriate optimizations. Energies, forces, stresses, material geometries, and electron density configurations are some of the values that can be output throughout the optimization. New version program summaryProgram Title: PROFESS Catalogue identifier: AEBN_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEBN_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 68 721 No. of bytes in distributed program, including test data, etc.: 1 708 547 Distribution format: tar.gz Programming language: Fortran 90 Computer

  5. Parallel Program Systems for the Analysis of Wave Processes in Elastic-Plastic, Granular, Porous and Multi-Blocky Media

    NASA Astrophysics Data System (ADS)

    Sadovskaya, Oxana; Sadovskii, Vladimir

    2017-04-01

    Under modeling the wave propagation processes in geomaterials (granular and porous media, soils and rocks) it is necessary to take into account the structural inhomogeneity of these materials. Parallel program systems for numerical solution of 2D and 3D problems of the dynamics of deformable media with constitutive relationships of rather general form on the basis of universal mathematical model describing small strains of elastic, elastic-plastic, granular and porous materials are worked out. In the case of an elastic material, the model is reduced to the system of equations, hyperbolic by Friedrichs, written in terms of velocities and stresses in a symmetric form. In the case of an elastic-plastic material, the model is a special formulation of the Prandtl-Reuss theory in the form of variational inequality with one-sided constraints on the stress tensor. Generalization of the model to describe granularity and the collapse of pores is obtained by means of the rheological approach, taking into account different resistance of materials to tension and compression. Rotational motion of particles in the material microstructure is considered within the framework of a mathematical model of the Cosserat continuum. Computational domain may have a blocky structure, composed of an arbitrary number of layers, strips in a layer and blocks in a strip from different materials with self-consistent curvilinear interfaces. At the external boundaries of computational domain the main types of dissipative boundary conditions in terms of velocities, stresses or mixed boundary conditions can be given. Shock-capturing algorithm is proposed for implementation of the model on supercomputers with cluster architecture. It is based on the two-cyclic splitting method with respect to spatial variables and the special procedures of the stresses correction to take into account plasticity, granularity or porosity of a material. An explicit monotone ENO-scheme is applied for solving one

  6. Remote origins of interannual variability in the Indonesian Throughflow region from data and a global Parallel Ocean Program simulation

    NASA Astrophysics Data System (ADS)

    McClean, Julie L.; Ivanova, Detelina P.; Sprintall, Janet

    2005-10-01

    The mean and interannual variability of the thermal structure of the World Ocean Circulation Experiment (WOCE) repeat IX1-expendable bathythermograph (XBT) transect between Java and Western Australia were compared statistically for the years 1987-1997 with concurrent, co-located output from a global eddy-permitting configuration of the Parallel Ocean Program (POP) model forced with realistic surface fluxes. Dominant variability at long timescales for both model and data in the southern IX1 region was associated with Pacific El Niño-Southern Oscillation (ENSO) events; at the northern end it was due to remote equatorial Indian Ocean forcing and Indian Ocean Dipole Mode events. In the Indo-Pacific domain the model reproduced the structure and magnitude of observed low-frequency variability. Event analyses following the warm ENSO phase showed low-frequency off-equatorial Rossby waves interacting with the North Pacific western maritime boundary to reflect onto the equator and excite a coastally trapped response that propagated through the Indonesian seas and along the northwest coast of Australia. In turn, the signal progressively propagated away from this coast as free baroclinic Rossby waves to 90°E. Cross-spectral analyses confirmed that on interannual timescales, both off-equatorial and equatorial signals remotely forced in the Pacific were largely responsible for the strong observed and modeled variability at the southern end of IX1.

  7. Parallel Logic Programming Architecture

    DTIC Science & Technology

    1990-04-01

    cooperation in distributed problem solving. IEEE Transactions on Systems, Man, and Cybernetics, SMC-l(1), 61-70. 33. Tanenbaum, A. S. (1988). Structured ... Computer Organization, Englewood Cliffs, NJ: Prentice-Hall. 34. Tanenbaum, A. S. (1988). Computer Networks. Englewood Cliffs, NJ: Prentice-Hall. 35

  8. Parallel computers

    SciTech Connect

    Treveaven, P.

    1989-01-01

    This book presents an introduction to object-oriented, functional, and logic parallel computing on which the fifth generation of computer systems will be based. Coverage includes concepts for parallel computing languages, a parallel object-oriented system (DOOM) and its language (POOL), an object-oriented multilevel VLSI simulator using POOL, and implementation of lazy functional languages on parallel architectures.

  9. Languages for parallel architectures

    SciTech Connect

    Bakker, J.W.

    1989-01-01

    This book presents mathematical methods for modelling parallel computer architectures, based on the results of ESPRIT's project 415 on computer languages for parallel architectures. Presented are investigations incorporating a wide variety of programming styles, including functional,logic, and object-oriented paradigms. Topics cover include Philips's parallel object-oriented language POOL, lazy-functional languages, the languages IDEAL, K-LEAF, FP2, and Petri-net semantics for the AADL language.

  10. Appendix E: Parallel Pascal development system

    NASA Technical Reports Server (NTRS)

    1985-01-01

    The Parallel Pascal Development System enables Parallel Pascal programs to be developed and tested on a conventional computer. It consists of several system programs, including a Parallel Pascal to standard Pascal translator, and a library of Parallel Pascal subprograms. The library includes subprograms for using Parallel Pascal on a parallel system with a fixed degree of parallelism, such as the Massively Parallel Processor, to conveniently manipulate arrays which have dimensions than the hardware. Programs can be conveninetly tested with small sized arrays on the conventional computer before attempting to run on a parallel system.

  11. Appendix E: Parallel Pascal development system

    NASA Technical Reports Server (NTRS)

    1985-01-01

    The Parallel Pascal Development System enables Parallel Pascal programs to be developed and tested on a conventional computer. It consists of several system programs, including a Parallel Pascal to standard Pascal translator, and a library of Parallel Pascal subprograms. The library includes subprograms for using Parallel Pascal on a parallel system with a fixed degree of parallelism, such as the Massively Parallel Processor, to conveniently manipulate arrays which have dimensions than the hardware. Programs can be conveninetly tested with small sized arrays on the conventional computer before attempting to run on a parallel system.

  12. Parallels between Global Transcriptional Programs of Polarizing Caco-2 Intestinal Epithelial Cells In Vitro and Gene Expression Programs in Normal Colon and Colon Cancer

    PubMed Central

    Sääf, Annika M.; Halbleib, Jennifer M.; Chen, Xin; Yuen, Siu Tsan; Leung, Suet Yi

    2007-01-01

    Posttranslational mechanisms are implicated in the development of epithelial cell polarity, but little is known about the patterns of gene expression and transcriptional regulation during this process. We characterized temporal patterns of gene expression during cell–cell adhesion-initiated polarization of cultured human Caco-2 cells, which develop structural and functional polarity resembling enterocytes in vivo. A distinctive switch in gene expression patterns occurred upon formation of cell–cell contacts. Comparison to gene expression patterns in normal human colon and colon tumors revealed that the pattern in proliferating, nonpolarized Caco-2 cells paralleled patterns seen in human colon cancer in vivo, including expression of genes involved in cell proliferation. The pattern switched in polarized Caco-2 cells to one more closely resembling that in normal colon tissue, indicating that regulation of transcription underlying Caco-2 cell polarization is similar to that during enterocyte differentiation in vivo. Surprisingly, the temporal program of gene expression in polarizing Caco-2 cells involved changes in signaling pathways (e.g., Wnt, Hh, BMP, FGF) in patterns similar to those during migration and differentiation of intestinal epithelial cells in vivo, despite the absence of morphogen gradients and interactions with stromal cells characteristic of enterocyte differentiation in situ. The full data set is available at http://microarray-pubs.stanford.edu/CACO2. PMID:17699589

  13. Work stealing for GPU-accelerated parallel programs in a global address space framework: WORK STEALING ON GPU-ACCELERATED SYSTEMS

    SciTech Connect

    Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram; Balaji, Pavan; Sadayappan, P.

    2016-01-06

    Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain.

  14. Managing Algorithmic Skeleton Nesting Requirements in Realistic Image Processing Applications: The Case of the SKiPPER-II Parallel Programming Environment's Operating Model

    NASA Astrophysics Data System (ADS)

    Coudarcher, Rémi; Duculty, Florent; Serot, Jocelyn; Jurie, Frédéric; Derutin, Jean-Pierre; Dhome, Michel

    2005-12-01

    SKiPPER is a SKeleton-based Parallel Programming EnviRonment being developed since 1996 and running at LASMEA Laboratory, the Blaise-Pascal University, France. The main goal of the project was to demonstrate the applicability of skeleton-based parallel programming techniques to the fast prototyping of reactive vision applications. This paper deals with the special features embedded in the latest version of the project: algorithmic skeleton nesting capabilities and a fully dynamic operating model. Throughout the case study of a complete and realistic image processing application, in which we have pointed out the requirement for skeleton nesting, we are presenting the operating model of this feature. The work described here is one of the few reported experiments showing the application of skeleton nesting facilities for the parallelisation of a realistic application, especially in the area of image processing. The image processing application we have chosen is a 3D face-tracking algorithm from appearance.

  15. Survey of new vector computers: The CRAY 1S from CRAY research; the CYBER 205 from CDC and the parallel computer from ICL - architecture and programming

    NASA Technical Reports Server (NTRS)

    Gentzsch, W.

    1982-01-01

    Problems which can arise with vector and parallel computers are discussed in a user oriented context. Emphasis is placed on the algorithms used and the programming techniques adopted. Three recently developed supercomputers are examined and typical application examples are given in CRAY FORTRAN, CYBER 205 FORTRAN and DAP (distributed array processor) FORTRAN. The systems performance is compared. The addition of parts of two N x N arrays is considered. The influence of the architecture on the algorithms and programming language is demonstrated. Numerical analysis of magnetohydrodynamic differential equations by an explicit difference method is illustrated, showing very good results for all three systems. The prognosis for supercomputer development is assessed.

  16. Parallel rendering

    NASA Technical Reports Server (NTRS)

    Crockett, Thomas W.

    1995-01-01

    This article provides a broad introduction to the subject of parallel rendering, encompassing both hardware and software systems. The focus is on the underlying concepts and the issues which arise in the design of parallel rendering algorithms and systems. We examine the different types of parallelism and how they can be applied in rendering applications. Concepts from parallel computing, such as data decomposition, task granularity, scalability, and load balancing, are considered in relation to the rendering problem. We also explore concepts from computer graphics, such as coherence and projection, which have a significant impact on the structure of parallel rendering algorithms. Our survey covers a number of practical considerations as well, including the choice of architectural platform, communication and memory requirements, and the problem of image assembly and display. We illustrate the discussion with numerous examples from the parallel rendering literature, representing most of the principal rendering methods currently used in computer graphics.

  17. Parallel computation

    NASA Astrophysics Data System (ADS)

    Huberman, Bernardo A.

    1989-11-01

    This paper reviews three different aspects of parallel computation which are useful for physics. The first part deals with special architectures for parallel computing (SIMD and MIMD machines) and their differences, with examples of their uses. The second section discusses the speedup that can be achieved in parallel computation and the constraints generated by the issues of communication and synchrony. The third part describes computation by distributed networks of powerful workstations without global controls and the issues involved in understanding their behavior.

  18. Parallel computing works

    SciTech Connect

    Not Available

    1991-10-23

    An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of many computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.

  19. A systolic array parallelizing compiler

    SciTech Connect

    Tseng, P.S. )

    1990-01-01

    This book presents a completely new approach to the problem of systolic array parallelizing compiler. It describes the AL parallelizing compiler for the Warp systolic array, the first working systolic array parallelizing compiler which can generate efficient parallel code for complete LINPACK routines. This book begins by analyzing the architectural strength of the Warp systolic array. It proposes a model for mapping programs onto the machine and introduces the notion of data relations for optimizing the program mapping. Also presented are successful applications of the AL compiler in matrix computation and image processing. A complete listing of the source program and compiler-generated parallel code are given to clarify the overall picture of the compiler. The book concludes that systolic array parallelizing compiler can produce efficient parallel code, almost identical to what the user would have written by hand.

  20. mm_par2.0: An object-oriented molecular dynamics simulation program parallelized using a hierarchical scheme with MPI and OPENMP

    NASA Astrophysics Data System (ADS)

    Oh, Kwang Jin; Kang, Ji Hoon; Myung, Hun Joo

    2012-02-01

    We have revised a general purpose parallel molecular dynamics simulation program mm_par using the object-oriented programming. We parallelized the revised version using a hierarchical scheme in order to utilize more processors for a given system size. The benchmark result will be presented here. New version program summaryProgram title: mm_par2.0 Catalogue identifier: ADXP_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADXP_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC license, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 2 390 858 No. of bytes in distributed program, including test data, etc.: 25 068 310 Distribution format: tar.gz Programming language: C++ Computer: Any system operated by Linux or Unix Operating system: Linux Classification: 7.7 External routines: We provide wrappers for FFTW [1], Intel MKL library [2] FFT routine, and Numerical recipes [3] FFT, random number generator, and eigenvalue solver routines, SPRNG [4] random number generator, Mersenne Twister [5] random number generator, space filling curve routine. Catalogue identifier of previous version: ADXP_v1_0 Journal reference of previous version: Comput. Phys. Comm. 174 (2006) 560 Does the new version supersede the previous version?: Yes Nature of problem: Structural, thermodynamic, and dynamical properties of fluids and solids from microscopic scales to mesoscopic scales. Solution method: Molecular dynamics simulation in NVE, NVT, and NPT ensemble, Langevin dynamics simulation, dissipative particle dynamics simulation. Reasons for new version: First, object-oriented programming has been used, which is known to be open for extension and closed for modification. It is also known to be better for maintenance. Second, version 1.0 was based on atom decomposition and domain decomposition scheme [6] for parallelization. However, atom

  1. Hardware-software-co-design of parallel and distributed systems using a behavioural programming and multi-process model with high-level synthesis

    NASA Astrophysics Data System (ADS)

    Bosse, Stefan

    2011-05-01

    A new design methodology for parallel and distributed embedded systems is presented using the behavioural hardware compiler ConPro providing an imperative programming model based on concurrently communicating sequential processes (CSP) with an extensive set of interprocess-communication primitives and guarded atomic actions. The programming language and the compiler-based synthesis process enables the design of constrained power- and resourceaware embedded systems with pure Register-Transfer-Logic (RTL) efficiently mapped to FPGA and ASIC technologies. Concurrency is modelled explicitly on control- and datapath level. Additionally, concurrency on data-path level can be automatically explored and optimized by different schedulers. The CSP programming model can be synthesized to hardware (SoC) and software (C,ML) models and targets. A common source for both hardware and software implementation with identical functional behaviour is used. Processes and objects of the entire design can be distributed on different hardware and software platforms, for example, several FPGA components and software executed on several microprocessors, providing a parallel and distributed system. Intersystem-, interprocess-, and object communication is automatically implemented with serial links, not visible on programming level. The presented design methodology has the benefit of high modularity, freedom of choice of target technologies, and system architecture. Algorithms can be well matched to and distributed on different suitable execution platforms and implementation technologies, using a unique programming model, providing a balance of concurrency and resource complexity. An extended case study of a communication protocol used in high-density sensor-actuator networks should demonstrate and compare the design of a hardware and software target. The communication protocol is suited for high-density intra-and interchip networks.

  2. 4MOST: science operations for a large spectroscopic survey program with multiple science cases executed in parallel

    NASA Astrophysics Data System (ADS)

    Walcher, C. Jakob; de Jong, Roelof S.; Dwelly, Tom; Bellido, Olga; Boller, Thomas; Chiappini, Cristina; Feltzing, Sofia; Irwin, Mike; McMahon, Richard; Merloni, Andrea; Schnurr, Olivier; Walton, Nicholas A.

    2016-07-01

    The 4MOST instrument is a multi-object spectrograph to be mounted to the VISTA telescope at ESOs La- Silla-Paranal observatory. 4MOST will deliver several 10s of millions of spectra from surveys typically lasting 5 years. 4MOST will address Galactic and extra-galactic science cases simultaneously, i.e. by observing targets from a large number of different surveys within one science exposure. This parallel mode of operations as well as the survey nature of 4MOST require some 4MOST-specific operations features within the overall operations model of ESO. These features are necessary to minimize any changes to the ESO operations model at the La- Silla-Paranal observatory on the one hand, and to enable parallel science observing and thus the most efficient use of the instrument on the other hand. The main feature is that the 4MOST consortium will not only deliver the instrument, but also contractual services to the user community, which is why 4MOST is also described as a 'facility'. We describe the operations model for 4MOST as seen by the consortium building the instrument. Among others this encompasses: 1) A joint science team for all participating surveys (i.e. including community surveys as well as those from the instrument-building consortium). 2) Common centralized tasks in observing preparation and data management provided as service by the consortium. 3) Transparency of all decisions to all stakeholders. 4) Close interaction between science and facility operations. Here we describe our efforts to make parallel observing mode efficient, flexible, and manageable.

  3. Efficient Parallel All-Electron Four-Component Dirac-Kohn-Sham Program Using a Distributed Matrix Approach II.

    PubMed

    Storchi, Loriano; Rampino, Sergio; Belpassi, Leonardo; Tarantelli, Francesco; Quiney, Harry M

    2013-12-10

    We propose a new complete memory-distributed algorithm, which significantly improves the parallel implementation of the all-electron four-component Dirac-Kohn-Sham (DKS) module of BERTHA (J. Chem. Theory Comput. 2010, 6, 384). We devised an original procedure for mapping the DKS matrix between an efficient integral-driven distribution, guided by the structure of specific G-spinor basis sets and by density fitting algorithms, and the two-dimensional block-cyclic distribution scheme required by the ScaLAPACK library employed for the linear algebra operations. This implementation, because of the efficiency in the memory distribution, represents a leap forward in the applicability of the DKS procedure to arbitrarily large molecular systems and its porting on last-generation massively parallel systems. The performance of the code is illustrated by some test calculations on several gold clusters of increasing size. The DKS self-consistent procedure has been explicitly converged for two representative clusters, namely Au20 and Au34, for which the density of electronic states is reported and discussed. The largest gold cluster uses more than 39k basis functions and DKS matrices of the order of 23 GB.

  4. Parallel Pascal - An extended Pascal for parallel computers

    NASA Technical Reports Server (NTRS)

    Reeves, A. P.

    1984-01-01

    Parallel Pascal is an extended version of the conventional serial Pascal programming language which includes a convenient syntax for specifying array operations. It is upward compatible with standard Pascal and involves only a small number of carefully chosen new features. Parallel Pascal was developed to reduce the semantic gap between standard Pascal and a large range of highly parallel computers. Two important design goals of Parallel Pascal were efficiency and portability. Portability is particularly difficult to achieve since different parallel computers frequently have very different capabilities.

  5. Parallel Pascal - An extended Pascal for parallel computers

    NASA Technical Reports Server (NTRS)

    Reeves, A. P.

    1984-01-01

    Parallel Pascal is an extended version of the conventional serial Pascal programming language which includes a convenient syntax for specifying array operations. It is upward compatible with standard Pascal and involves only a small number of carefully chosen new features. Parallel Pascal was developed to reduce the semantic gap between standard Pascal and a large range of highly parallel computers. Two important design goals of Parallel Pascal were efficiency and portability. Portability is particularly difficult to achieve since different parallel computers frequently have very different capabilities.

  6. Parallel machines: Parallel machine languages

    SciTech Connect

    Iannucci, R.A. )

    1990-01-01

    This book presents a framework for understanding the tradeoffs between the conventional view and the dataflow view with the objective of discovering the critical hardware structures which must be present in any scalable, general-purpose parallel computer to effectively tolerate latency and synchronization costs. The author presents an approach to scalable general purpose parallel computation. Linguistic Concerns, Compiling Issues, Intermediate Language Issues, and hardware/technological constraints are presented as a combined approach to architectural Develoement. This book presents the notion of a parallel machine language.

  7. Blade row dynamic digital compressor program. Volume 1: J85 clean inlet flow and parallel compressor models

    NASA Technical Reports Server (NTRS)

    Tesch, W. A.; Steenken, W. G.

    1976-01-01

    The results are presented of a one-dimensional dynamic digital blade row compressor model study of a J85-13 engine operating with uniform and with circumferentially distorted inlet flow. Details of the geometry and the derived blade row characteristics used to simulate the clean inlet performance are given. A stability criterion based upon the self developing unsteady internal flows near surge provided an accurate determination of the clean inlet surge line. The basic model was modified to include an arbitrary extent multi-sector parallel compressor configuration for investigating 180 deg 1/rev total pressure, total temperature, and combined total pressure and total temperature distortions. The combined distortions included opposed, coincident, and 90 deg overlapped patterns. The predicted losses in surge pressure ratio matched the measured data trends at all speeds and gave accurate predictions at high corrected speeds where the slope of the speed lines approached the vertical.

  8. Six-year cost trends at PPG industries paralleling the introduction of health promotion programs directed at cardiovascular disease prevention.

    PubMed

    Goetzel, Ron Z; Kowlessar, Niranjana M; Henke, Rachel; Benevent, Richele; Tabrizi, Maryam; Colombi, Alberto M

    2013-05-01

    Over the past several years, PPG Industries (PPG) implemented worksite health promotion programs aimed at improving employees' health and reducing overall medical costs as well as those specific to cardiovascular disease. Using medical claims data, we examined trends in these costs among PPG employees for a 6-year period, from 2005 to 2010. Overall medical costs remained relatively flat, increasing by 1.2% compounded annually, unadjusted for inflation, while inflation-adjusted costs declined by 2.9%. Comparing worksites rated "high-high" on both program implementation and leadership support with worksites scoring highly on one or none of those dimensions, the "high-high" group experienced a decreasing cost trend, whereas the "other" group showed an increase. The analysis suggests that PPG's efforts to reconfigure and intensify its wellness program offerings may have resulted lower health care cost trends.

  9. Final Report, Center for Programming Models for Scalable Parallel Computing: Co-Array Fortran, Grant Number DE-FC02-01ER25505

    SciTech Connect

    Robert W. Numrich

    2008-04-22

    The major accomplishment of this project is the production of CafLib, an 'object-oriented' parallel numerical library written in Co-Array Fortran. CafLib contains distributed objects such as block vectors and block matrices along with procedures, attached to each object, that perform basic linear algebra operations such as matrix multiplication, matrix transpose and LU decomposition. It also contains constructors and destructors for each object that hide the details of data decomposition from the programmer, and it contains collective operations that allow the programmer to calculate global reductions, such as global sums, global minima and global maxima, as well as vector and matrix norms of several kinds. CafLib is designed to be extensible in such a way that programmers can define distributed grid and field objects, based on vector and matrix objects from the library, for finite difference algorithms to solve partial differential equations. A very important extra benefit that resulted from the project is the inclusion of the co-array programming model in the next Fortran standard called Fortran 2008. It is the first parallel programming model ever included as a standard part of the language. Co-arrays will be a supported feature in all Fortran compilers, and the portability provided by standardization will encourage a large number of programmers to adopt it for new parallel application development. The combination of object-oriented programming in Fortran 2003 with co-arrays in Fortran 2008 provides a very powerful programming model for high-performance scientific computing. Additional benefits from the project, beyond the original goal, include a programto provide access to the co-array model through access to the Cray compiler as a resource for teaching and research. Several academics, for the first time, included the co-array model as a topic in their courses on parallel computing. A separate collaborative project with LANL and PNNL showed how to extend the

  10. Parallel pipelining

    SciTech Connect

    Joseph, D.D.; Bai, R.; Liao, T.Y.; Huang, A.; Hu, H.H.

    1995-09-01

    In this paper the authors introduce the idea of parallel pipelining for water lubricated transportation of oil (or other viscous material). A parallel system can have major advantages over a single pipe with respect to the cost of maintenance and continuous operation of the system, to the pressure gradients required to restart a stopped system and to the reduction and even elimination of the fouling of pipe walls in continuous operation. The authors show that the action of capillarity in small pipes is more favorable for restart than in large pipes. In a parallel pipeline system, they estimate the number of small pipes needed to deliver the same oil flux as in one larger pipe as N = (R/r){sup {alpha}}, where r and R are the radii of the small and large pipes, respectively, and {alpha} = 4 or 19/7 when the lubricating water flow is laminar or turbulent.

  11. Playable Serious Games for Studying and Programming Computational STEM and Informatics Applications of Distributed and Parallel Computer Architectures

    ERIC Educational Resources Information Center

    Amenyo, John-Thones

    2012-01-01

    Carefully engineered playable games can serve as vehicles for students and practitioners to learn and explore the programming of advanced computer architectures to execute applications, such as high performance computing (HPC) and complex, inter-networked, distributed systems. The article presents families of playable games that are grounded in…

  12. Playable Serious Games for Studying and Programming Computational STEM and Informatics Applications of Distributed and Parallel Computer Architectures

    ERIC Educational Resources Information Center

    Amenyo, John-Thones

    2012-01-01

    Carefully engineered playable games can serve as vehicles for students and practitioners to learn and explore the programming of advanced computer architectures to execute applications, such as high performance computing (HPC) and complex, inter-networked, distributed systems. The article presents families of playable games that are grounded in…

  13. High performance parallel architectures

    SciTech Connect

    Anderson, R.E. )

    1989-09-01

    In this paper the author describes current high performance parallel computer architectures. A taxonomy is presented to show computer architecture from the user programmer's point-of-view. The effects of the taxonomy upon the programming model are described. Some current architectures are described with respect to the taxonomy. Finally, some predictions about future systems are presented. 5 refs., 1 fig.

  14. Development of Parallel GSSHA

    DTIC Science & Technology

    2013-09-01

    C en te r Paul R. Eller , Jing-Ru C. Cheng, Aaron R. Byrd, Charles W. Downer, and Nawa Pradhan September 2013 Approved for public release...Program ERDC TR-13-8 September 2013 Development of Parallel GSSHA Paul R. Eller and Jing-Ru C. Cheng Information Technology Laboratory US Army Engineer...5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) Paul Eller , Ruth Cheng, Aaron Byrd, Chuck Downer, and Nawa Pradhan 5d. PROJECT NUMBER

  15. Structure_threader: An improved method for automation and parallelization of programs structure, fastStructure and MavericK on multicore CPU systems.

    PubMed

    Pina-Martins, Francisco; Silva, Diogo N; Fino, Joana; Paulo, Octávio S

    2017-08-04

    Structure_threader is a program to parallelize multiple runs of genetic clustering software that does not make use of multithreading technology (structure, fastStructure and MavericK) on multicore computers. Our approach was benchmarked across multiple systems and displayed great speed improvements relative to the single-threaded implementation, scaling very close to linearly with the number of physical cores used. Structure_threader was compared to previous software written for the same task-ParallelStructure and StrAuto and was proven to be the faster (up to 25% faster) wrapper under all tested scenarios. Furthermore, Structure_threader can perform several automatic and convenient operations, assisting the user in assessing the most biologically likely value of 'K' via implementations such as the "Evanno," or "Thermodynamic Integration" tests and automatically draw the "meanQ" plots (static or interactive) for each value of K (or even combined plots). Structure_threader is written in python 3 and licensed under the GPLv3. It can be downloaded free of charge at https://github.com/StuntsPT/Structure_threader. © 2017 John Wiley & Sons Ltd.

  16. Speeding up parallel processing

    NASA Technical Reports Server (NTRS)

    Denning, Peter J.

    1988-01-01

    In 1967 Amdahl expressed doubts about the ultimate utility of multiprocessors. The formulation, now called Amdahl's law, became part of the computing folklore and has inspired much skepticism about the ability of the current generation of massively parallel processors to efficiently deliver all their computing power to programs. The widely publicized recent results of a group at Sandia National Laboratory, which showed speedup on a 1024 node hypercube of over 500 for three fixed size problems and over 1000 for three scalable problems, have convincingly challenged this bit of folklore and have given new impetus to parallel scientific computing.

  17. Effectiveness of Inclusion of Dry Needling in a Multimodal Therapy Program for Patellofemoral Pain: A Randomized Parallel-Group Trial.

    PubMed

    Espí-López, Gemma V; Serra-Añó, Pilar; Vicent-Ferrando, Juan; Sánchez-Moreno-Giner, Miguel; Arias-Buría, Jose L; Cleland, Joshua; Fernández-de-Las-Peñas, César

    2017-06-01

    Study Design Randomized controlled trial. Background Evidence suggests that multimodal interventions that include exercise therapy may be effective for patellofemoral pain (PFP); however, no study has investigated the effects of trigger point (TrP) dry needling (DN) in people with PFP. Objectives To compare the effects of adding TrP DN to a manual therapy and exercise program on pain, function, and disability in individuals with PFP. Methods Individuals with PFP (n = 60) recruited from a public hospital in Valencia, Spain were randomly allocated to manual therapy and exercises (n = 30) or manual therapy and exercise plus TrP DN (n = 30). Both groups received the same manual therapy and strengthening exercise program for 3 sessions (once a week for 3 weeks), and 1 group also received TrP DN to active TrPs within the vastus medialis and vastus lateralis muscles. The pain subscale of the Knee injury and Osteoarthritis Outcome Score (KOOS; 0-100 scale) was used as the primary outcome. Secondary outcomes included other subscales of the KOOS, the Knee Society Score, the International Knee Documentation Committee Subjective Knee Evaluation Form (IKDC), and the numeric pain-rating scale. Patients were assessed at baseline and at 15-day (posttreatment) and 3-month follow-ups. Analysis was conducted with mixed analyses of covariance, adjusted for baseline scores. Results At 3 months, 58 subjects (97%) completed the follow-up. No significant between-group differences (all, P>.391) were observed for any outcome: KOOS pain subscale mean difference, -2.1 (95% confidence interval [CI]: -4.6, 0.4); IKDC mean difference, 2.3 (95% CI: -0.1, 4.7); knee pain intensity mean difference, 0.3 (95% CI: -0.2, 0.8). Both groups experienced similar moderate-to-large within-group improvements in all outcomes (standardized mean differences of 0.6 to 1.1); however, only the KOOS function in sport and recreation subscale surpassed the prespecified minimum important change. Conclusion The current

  18. Parallel Quality Assessment of Emergency Departments by European Foundation for Quality Management Model and Iranian National Program for Hospital Evaluation

    PubMed Central

    IMANI NASAB, Mohammad Hasan; MOHAGHEGH, Bahram; KHALESI, Nader; JAAFARIPOOYAN, Ebrahim

    2013-01-01

    Background European Foundation for Quality Management (EFQM) model is a widely used quality management system (QMS) worldwide, including Iran. Current study aims to verify the quality assessment results of Iranian National Program for Hospital Evaluation (INPHE) based on those of EFQM. Methods: This cross-sectional study was conducted in 2012 on a sample of emergency departments (EDs) affiliated with Tehran University of Medical Sciences (TUMS), Iran. The standard questionnaire of EFQM (V-2010) was used to gather appropriate data. The results were compared with those of INPHE. MS Excel was used to classify and display the findings. Results: The average assessment score of the EDs based on the INPHE and EFQM model were largely different (i.e. 86.4% and 31%, respectively). In addition, the variation range among five EDs’ scores according to each model was also considerable (22% for EFQM against 7% of INPHE), especially in the EDs with and without prior record of applying QMSs. Conclusion: The INPHE’s assessment results were not confirmed by EFQM model. Moreover, the higher variation range among EDs’ scores using EFQM model could allude to its more differentiation power in assessing the performance comparing with INPHE. Therefore, a need for improvement in the latter drawing on other QMSs’ (such as EFQM) strengths, given the results emanated from its comparison with EFQM seems indispensable. PMID:23967429

  19. File concepts for parallel I/O

    NASA Technical Reports Server (NTRS)

    Crockett, Thomas W.

    1989-01-01

    The subject of input/output (I/O) was often neglected in the design of parallel computer systems, although for many problems I/O rates will limit the speedup attainable. The I/O problem is addressed by considering the role of files in parallel systems. The notion of parallel files is introduced. Parallel files provide for concurrent access by multiple processes, and utilize parallelism in the I/O system to improve performance. Parallel files can also be used conventionally by sequential programs. A set of standard parallel file organizations is proposed, organizations are suggested, using multiple storage devices. Problem areas are also identified and discussed.

  20. Internal combustion engine control for series hybrid electric vehicles by parallel and distributed genetic programming/multiobjective genetic algorithms

    NASA Astrophysics Data System (ADS)

    Gladwin, D.; Stewart, P.; Stewart, J.

    2011-02-01

    This article addresses the problem of maintaining a stable rectified DC output from the three-phase AC generator in a series-hybrid vehicle powertrain. The series-hybrid prime power source generally comprises an internal combustion (IC) engine driving a three-phase permanent magnet generator whose output is rectified to DC. A recent development has been to control the engine/generator combination by an electronically actuated throttle. This system can be represented as a nonlinear system with significant time delay. Previously, voltage control of the generator output has been achieved by model predictive methods such as the Smith Predictor. These methods rely on the incorporation of an accurate system model and time delay into the control algorithm, with a consequent increase in computational complexity in the real-time controller, and as a necessity relies to some extent on the accuracy of the models. Two complementary performance objectives exist for the control system. Firstly, to maintain the IC engine at its optimal operating point, and secondly, to supply a stable DC supply to the traction drive inverters. Achievement of these goals minimises the transient energy storage requirements at the DC link, with a consequent reduction in both weight and cost. These objectives imply constant velocity operation of the IC engine under external load disturbances and changes in both operating conditions and vehicle speed set-points. In order to achieve these objectives, and reduce the complexity of implementation, in this article a controller is designed by the use of Genetic Programming methods in the Simulink modelling environment, with the aim of obtaining a relatively simple controller for the time-delay system which does not rely on the implementation of real time system models or time delay approximations in the controller. A methodology is presented to utilise the miriad of existing control blocks in the Simulink libraries to automatically evolve optimal control

  1. Both the Caspase CSP-1 and a Caspase-Independent Pathway Promote Programmed Cell Death in Parallel to the Canonical Pathway for Apoptosis in Caenorhabditis elegans

    PubMed Central

    Denning, Daniel P.; Hatch, Victoria; Horvitz, H. Robert

    2013-01-01

    Caspases are cysteine proteases that can drive apoptosis in metazoans and have critical functions in the elimination of cells during development, the maintenance of tissue homeostasis, and responses to cellular damage. Although a growing body of research suggests that programmed cell death can occur in the absence of caspases, mammalian studies of caspase-independent apoptosis are confounded by the existence of at least seven caspase homologs that can function redundantly to promote cell death. Caspase-independent programmed cell death is also thought to occur in the invertebrate nematode Caenorhabditis elegans. The C. elegans genome contains four caspase genes (ced-3, csp-1, csp-2, and csp-3), of which only ced-3 has been demonstrated to promote apoptosis. Here, we show that CSP-1 is a pro-apoptotic caspase that promotes programmed cell death in a subset of cells fated to die during C. elegans embryogenesis. csp-1 is expressed robustly in late pachytene nuclei of the germline and is required maternally for its role in embryonic programmed cell deaths. Unlike CED-3, CSP-1 is not regulated by the APAF-1 homolog CED-4 or the BCL-2 homolog CED-9, revealing that csp-1 functions independently of the canonical genetic pathway for apoptosis. Previously we demonstrated that embryos lacking all four caspases can eliminate cells through an extrusion mechanism and that these cells are apoptotic. Extruded cells differ from cells that normally undergo programmed cell death not only by being extruded but also by not being engulfed by neighboring cells. In this study, we identify in csp-3; csp-1; csp-2 ced-3 quadruple mutants apoptotic cell corpses that fully resemble wild-type cell corpses: these caspase-deficient cell corpses are morphologically apoptotic, are not extruded, and are internalized by engulfing cells. We conclude that both caspase-dependent and caspase-independent pathways promote apoptotic programmed cell death and the phagocytosis of cell corpses in parallel to

  2. The NAS Parallel Benchmarks

    SciTech Connect

    Bailey, David H.

    2009-11-15

    The NAS Parallel Benchmarks (NPB) are a suite of parallel computer performance benchmarks. They were originally developed at the NASA Ames Research Center in 1991 to assess high-end parallel supercomputers. Although they are no longer used as widely as they once were for comparing high-end system performance, they continue to be studied and analyzed a great deal in the high-performance computing community. The acronym 'NAS' originally stood for the Numerical Aeronautical Simulation Program at NASA Ames. The name of this organization was subsequently changed to the Numerical Aerospace Simulation Program, and more recently to the NASA Advanced Supercomputing Center, although the acronym remains 'NAS.' The developers of the original NPB suite were David H. Bailey, Eric Barszcz, John Barton, David Browning, Russell Carter, LeoDagum, Rod Fatoohi, Samuel Fineberg, Paul Frederickson, Thomas Lasinski, Rob Schreiber, Horst Simon, V. Venkatakrishnan and Sisira Weeratunga. The original NAS Parallel Benchmarks consisted of eight individual benchmark problems, each of which focused on some aspect of scientific computing. The principal focus was in computational aerophysics, although most of these benchmarks have much broader relevance, since in a much larger sense they are typical of many real-world scientific computing applications. The NPB suite grew out of the need for a more rational procedure to select new supercomputers for acquisition by NASA. The emergence of commercially available highly parallel computer systems in the late 1980s offered an attractive alternative to parallel vector supercomputers that had been the mainstay of high-end scientific computing. However, the introduction of highly parallel systems was accompanied by a regrettable level of hype, not only on the part of the commercial vendors but even, in some cases, by scientists using the systems. As a result, it was difficult to discern whether the new systems offered any fundamental performance advantage

  3. EDDY RESOLVING NUTRIENT ECODYNAMICS IN THE GLOBAL PARALLEL OCEAN PROGRAM AND CONNECTIONS WITH TRACE GASES IN THE SULFUR, HALOGEN AND NMHC CYCLES

    SciTech Connect

    S. CHU; S. ELLIOTT

    2000-08-01

    Ecodynamics and the sea-air transfer of climate relevant trace gases are intimately coupled in the oceanic mixed layer. Ventilation of species such as dimethyl sulfide and methyl bromide constitutes a key linkage within the earth system. We are creating a research tool for the study of marine trace gas distributions by implementing coupled ecology-gas chemistry in the high resolution Parallel Ocean Program (POP). The fundamental circulation model is eddy resolving, with cell sizes averaging 0.15 degree (lat/long). Here we describe ecochemistry integration. Density dependent mortality and iron geochemistry have enhanced agreement with chlorophyll measurements. Indications are that dimethyl sulfide production rates must be adjusted for latitude dependence to match recent compilations. This may reflect the need for phytoplankton to conserve nitrogen by favoring sulfurous osmolytes. Global simulations are also available for carbonyl sulfide, the methyl halides and for nonmethane hydrocarbons. We discuss future applications including interaction with atmospheric chemistry models, high resolution biogeochemical snapshots and the study of open ocean fertilization.

  4. Parallel Logic Programming and ZMOB

    DTIC Science & Technology

    1988-04-26

    34 IEEE Computer, November 1987, 66-74. 19. Weiser, M. and Shneiderman, B., "Human Factors of Software Design and Develop- ment," in Handbook of Human Factors, ed. Gavriel Salvendy , John Wiley & Sons, 1987. VC

  5. Parallelizing Data-Centric Programs

    DTIC Science & Technology

    2013-09-25

    is undesirable. As a motivating example, 6 Figure 5: A visual path from Il Vittoriano to the Colosseum in our Forum dataset. Each consecutive image...images of Rome (out of a collection of nearly 75K images) connecting two landmarks, the Il Vittoriano Monument and the Colosseum . While both these

  6. Progress in parallelizing XOOPIC

    NASA Astrophysics Data System (ADS)

    Mardahl, Peter; Verboncoeur, J. P.

    1997-11-01

    XOOPIC (Object Orient Particle in Cell code for X11-based Unix workstations) is presently a serial 2-D 3v particle-in-cell plasma simulation (J.P. Verboncoeur, A.B. Langdon, and N.T. Gladd, ``An object-oriented electromagnetic PIC code.'' Computer Physics Communications 87 (1995) 199-211.). The present effort focuses on using parallel and distributed processing to optimize the simulation for large problems. The benefits include increased capacity for memory intensive problems, and improved performance for processor-intensive problems. The MPI library is used to enable the parallel version to be easily ported to massively parallel, SMP, and distributed computers. The philosophy employed here is to spatially decompose the system into computational regions separated by 'virtual boundaries', objects which contain the local data and algorithms to perform the local field solve and particle communication between regions. This implementation will reduce the changes required in the rest of the program by parallelization. Specific implementation details such as the hiding of communication latency behind local computation will also be discussed.

  7. EFFICIENT SCHEDULING OF PARALLEL JOBS ON MASSIVELY PARALLEL SYSTEMS

    SciTech Connect

    F. PETRINI; W. FENG

    1999-09-01

    We present buffered coscheduling, a new methodology to multitask parallel jobs in a message-passing environment and to develop parallel programs that can pave the way to the efficient implementation of a distributed operating system. Buffered coscheduling is based on three innovative techniques: communication buffering, strobing, and non-blocking communication. By leveraging these techniques, we can perform effective optimizations based on the global status of the parallel machine rather than on the limited knowledge available locally to each processor. The advantages of buffered coscheduling include higher resource utilization, reduced communication overhead, efficient implementation of low-control strategies and fault-tolerant protocols, accurate performance modeling, and a simplified yet still expressive parallel programming model. Preliminary experimental results show that buffered coscheduling is very effective in increasing the overall performance in the presence of load imbalance and communication-intensive workloads.

  8. New NAS Parallel Benchmarks Results

    NASA Technical Reports Server (NTRS)

    Yarrow, Maurice; Saphir, William; VanderWijngaart, Rob; Woo, Alex; Kutler, Paul (Technical Monitor)

    1997-01-01

    NPB2 (NAS (NASA Advanced Supercomputing) Parallel Benchmarks 2) is an implementation, based on Fortran and the MPI (message passing interface) message passing standard, of the original NAS Parallel Benchmark specifications. NPB2 programs are run with little or no tuning, in contrast to NPB vendor implementations, which are highly optimized for specific architectures. NPB2 results complement, rather than replace, NPB results. Because they have not been optimized by vendors, NPB2 implementations approximate the performance a typical user can expect for a portable parallel program on distributed memory parallel computers. Together these results provide an insightful comparison of the real-world performance of high-performance computers. New NPB2 features: New implementation (CG), new workstation class problem sizes, new serial sample versions, more performance statistics.

  9. Parallel machine architecture and compiler design facilities

    NASA Technical Reports Server (NTRS)

    Kuck, David J.; Yew, Pen-Chung; Padua, David; Sameh, Ahmed; Veidenbaum, Alex

    1990-01-01

    The objective is to provide an integrated simulation environment for studying and evaluating various issues in designing parallel systems, including machine architectures, parallelizing compiler techniques, and parallel algorithms. The status of Delta project (which objective is to provide a facility to allow rapid prototyping of parallelized compilers that can target toward different machine architectures) is summarized. Included are the surveys of the program manipulation tools developed, the environmental software supporting Delta, and the compiler research projects in which Delta has played a role.

  10. Address tracing for parallel machines

    NASA Technical Reports Server (NTRS)

    Stunkel, Craig B.; Janssens, Bob; Fuchs, W. Kent

    1991-01-01

    Recently implemented parallel system address-tracing methods based on several metrics are surveyed. The issues specific to collection of traces for both shared and distributed memory parallel computers are highlighted. Five general categories of address-trace collection methods are examined: hardware-captured, interrupt-based, simulation-based, altered microcode-based, and instrumented program-based traces. The problems unique to shared memory and distributed memory multiprocessors are examined separately.

  11. Debugging in a parallel environment

    SciTech Connect

    Wasserman, H.J.; Griffin, J.H.

    1985-01-01

    This paper describes the preliminary results of a project investigating approaches to dynamic debugging in parallel processing systems. Debugging programs in a multiprocessing environment is particularly difficult because of potential errors in synchronization of tasks, data dependencies, sharing of data among tasks, and irreproducibility of specific machine instruction sequences from one job to the next. The basic methodology involved in predicate-based debuggers is given as well as other desirable features of dynamic parallel debugging. 13 refs.

  12. Multigrid on massively parallel architectures

    SciTech Connect

    Falgout, R D; Jones, J E

    1999-09-17

    The scalable implementation of multigrid methods for machines with several thousands of processors is investigated. Parallel performance models are presented for three different structured-grid multigrid algorithms, and a description is given of how these models can be used to guide implementation. Potential pitfalls are illustrated when moving from moderate-sized parallelism to large-scale parallelism, and results are given from existing multigrid codes to support the discussion. Finally, the use of mixed programming models is investigated for multigrid codes on clusters of SMPs.

  13. Advantages of a programmed temperature vaporizer inlet and parallel factor analysis in the determination of triazines in the presence of non-intentionally added substances by gas chromatography.

    PubMed

    Rubio, L; Sarabia, L A; Herrero, A; Ortiz, M C

    2012-05-01

    Non-intentionally added substances (NIASs) are usually detected by acquiring mass spectra in full scan mode and then identifying the compounds corresponding to the unexpected peaks. High-resolution mass spectrometry detectors are frequently used, but this does not solve the problem that an NIAS can contribute to the abundance at m/z ratios that correspond to the fragmentation of other molecules. This problem leads to false negatives when identifying compounds, even in target analysis when the maximum permitted tolerances for relative ion abundances (SANCO/10684/2009) are taken into account. In this work, the introduction of different volumes of a test sample onto a GC/MS system that has a programmed temperature vaporizer inlet and is operating in full scan mode provides a data tensor. The proposed approach consists of considering the structure of the matrix of abundances of K m/z ratios acquired at J elution times for each chromatographic peak. Upon concatenating I of such matrices, a three-way tensor X is obtained, which is then decomposed using parallel factor analysis into as many factors as there are substances coeluting, thus providing the mass spectrum and the chromatographic profile for each of them. If the amount of an analyte changes significantly during the I injections, then it can be unequivocally identified. This procedure thus identifies coeluting NIASs, provides information about their mass spectra, and guarantees the identification and quantification of target compounds. In this work, it is used to determine five triazines in the presence of NIASs which match some of the m/z ratios of the triazines and coelute with them. Decision limits (CCα) of between 7.5 and 25.0 μg L(-1) were obtained.

  14. An experimental APL compiler for a distributed memory parallel machine

    SciTech Connect

    Ching, W.M.; Katz, A.

    1994-12-31

    The authors developed an experimental APL compiler for the IBM SP1 distributed memory parallel machine. It accepts classical APL programs, without additional directives, and generates parallelized C code for execution on the SP1 machine. The compiler exploits data parallelism in APL programs based on parallel high level primitives. Program variables are either replicated or partitioned. They also present performance data for five moderate size programs running on the SP1.

  15. Parallel pivoting combined with parallel reduction

    NASA Technical Reports Server (NTRS)

    Alaghband, Gita

    1987-01-01

    Parallel algorithms for triangularization of large, sparse, and unsymmetric matrices are presented. The method combines the parallel reduction with a new parallel pivoting technique, control over generations of fill-ins and a check for numerical stability, all done in parallel with the work being distributed over the active processes. The parallel technique uses the compatibility relation between pivots to identify parallel pivot candidates and uses the Markowitz number of pivots to minimize fill-in. This technique is not a preordering of the sparse matrix and is applied dynamically as the decomposition proceeds.

  16. Task parallelism and high-performance languages

    SciTech Connect

    Foster, I.

    1996-03-01

    The definition of High Performance Fortran (HPF) is a significant event in the maturation of parallel computing: it represents the first parallel language that has gained widespread support from vendors and users. The subject of this paper is to incorporate support for task parallelism. The term task parallelism refers to the explicit creation of multiple threads of control, or tasks, which synchronize and communicate under programmer control. Task and data parallelism are complementary rather than competing programming models. While task parallelism is more general and can be used to implement algorithms that are not amenable to data-parallel solutions, many problems can benefit from a mixed approach, with for example a task-parallel coordination layer integrating multiple data-parallel computations. Other problems admit to both data- and task-parallel solutions, with the better solution depending on machine characteristics, compiler performance, or personal taste. For these reasons, we believe that a general-purpose high-performance language should integrate both task- and data-parallel constructs. The challenge is to do so in a way that provides the expressivity needed for applications, while preserving the flexibility and portability of a high-level language. In this paper, we examine and illustrate the considerations that motivate the use of task parallelism. We also describe one particular approach to task parallelism in Fortran, namely the Fortran M extensions. Finally, we contrast Fortran M with other proposed approaches and discuss the implications of this work for task parallelism and high-performance languages.

  17. Gang scheduling a parallel machine

    SciTech Connect

    Gorda, B.C.; Brooks, E.D. III.

    1991-03-01

    Program development on parallel machines can be a nightmare of scheduling headaches. We have developed a portable time sharing mechanism to handle the problem of scheduling gangs of processors. User program and their gangs of processors are put to sleep and awakened by the gang scheduler to provide a time sharing environment. Time quantums are adjusted according to priority queues and a system of fair share accounting. The initial platform for this software is the 128 processor BBN TC2000 in use in the Massively Parallel Computing Initiative at the Lawrence Livermore National Laboratory. 2 refs., 1 fig.

  18. Gang scheduling a parallel machine

    SciTech Connect

    Gorda, B.C.; Brooks, E.D. III.

    1991-12-01

    Program development on parallel machines can be a nightmare of scheduling headaches. We have developed a portable time sharing mechanism to handle the problem of scheduling gangs of processes. User programs and their gangs of processes are put to sleep and awakened by the gang scheduler to provide a time sharing environment. Time quantum are adjusted according to priority queues and a system of fair share accounting. The initial platform for this software is the 128 processor BBN TC2000 in use in the Massively Parallel Computing Initiative at the Lawrence Livermore National Laboratory.

  19. Special parallel processing workshop

    SciTech Connect

    1994-12-01

    This report contains viewgraphs from the Special Parallel Processing Workshop. These viewgraphs deal with topics such as parallel processing performance, message passing, queue structure, and other basic concept detailing with parallel processing.

  20. Implementation and performance of parallelized elegant.

    SciTech Connect

    Wang, Y.; Borland, M.; Accelerator Systems Division

    2008-01-01

    The program elegant is widely used for design and modeling of linacs for free-electron lasers and energy recovery linacs, as well as storage rings and other applications. As part of a multi-year effort, we have parallelized many aspects of the code, including single-particle dynamics, wakefields, and coherent synchrotron radiation. We report on the approach used for gradual parallelization, which proved very beneficial in getting parallel features into the hands of users quickly. We also report details of parallelization of collective effects. Finally, we discuss performance of the parallelized code in various applications.

  1. Data-parallel algorithms for image computing

    NASA Astrophysics Data System (ADS)

    Carlotto, Mark J.

    1990-11-01

    Data-parallel algorithms for image computing on the Connection Machine are described. After a brief review of some basic programming concepts in *Lip, a parallel extension of Common Lisp, data-parallel programming paradigms based on a local (diffusion-like) model of computation, the scan model of computation, a general interprocessor communications model, and a region-based model are introduced. Algorithms for connected component labeling, distance transformation, Voronoi diagrams, finding minimum cost paths, local means, shape-from-shading, hidden surface calculations, affine transformation, oblique parallel projection, and spatial operations over regions are presented. An new algorithm for interpolating irregularly spaced data via Voronoi diagrams is also described.

  2. Parallel Eclipse Project Checkout

    NASA Technical Reports Server (NTRS)

    Crockett, Thomas M.; Joswig, Joseph C.; Shams, Khawaja S.; Powell, Mark W.; Bachmann, Andrew G.

    2011-01-01

    Parallel Eclipse Project Checkout (PEPC) is a program written to leverage parallelism and to automate the checkout process of plug-ins created in Eclipse RCP (Rich Client Platform). Eclipse plug-ins can be aggregated in a feature project. This innovation digests a feature description (xml file) and automatically checks out all of the plug-ins listed in the feature. This resolves the issue of manually checking out each plug-in required to work on the project. To minimize the amount of time necessary to checkout the plug-ins, this program makes the plug-in checkouts parallel. After parsing the feature, a request to checkout for each plug-in in the feature has been inserted. These requests are handled by a thread pool with a configurable number of threads. By checking out the plug-ins in parallel, the checkout process is streamlined before getting started on the project. For instance, projects that took 30 minutes to checkout now take less than 5 minutes. The effect is especially clear on a Mac, which has a network monitor displaying the bandwidth use. When running the client from a developer s home, the checkout process now saturates the bandwidth in order to get all the plug-ins checked out as fast as possible. For comparison, a checkout process that ranged from 8-200 Kbps from a developer s home is now able to saturate a pipe of 1.3 Mbps, resulting in significantly faster checkouts. Eclipse IDE (integrated development environment) tries to build a project as soon as it is downloaded. As part of another optimization, this innovation programmatically tells Eclipse to stop building while checkouts are happening, which dramatically reduces lock contention and enables plug-ins to continue downloading until all of them finish. Furthermore, the software re-enables automatic building, and forces Eclipse to do a clean build once it finishes checking out all of the plug-ins. This software is fully generic and does not contain any NASA-specific code. It can be applied to any

  3. Algorithms for the Construction of Parallel Tests by Zero-One Programming. Project Psychometric Aspects of Item Banking No. 7. Research Report 86-7.

    ERIC Educational Resources Information Center

    Boekkooi-Timminga, Ellen

    Nine methods for automated test construction are described. All are based on the concepts of information from item response theory. Two general kinds of methods for the construction of parallel tests are presented: (1) sequential test design; and (2) simultaneous test design. Sequential design implies that the tests are constructed one after the…

  4. Parallel rendering techniques for massively parallel visualization

    SciTech Connect

    Hansen, C.; Krogh, M.; Painter, J.

    1995-07-01

    As the resolution of simulation models increases, scientific visualization algorithms which take advantage of the large memory. and parallelism of Massively Parallel Processors (MPPs) are becoming increasingly important. For large applications rendering on the MPP tends to be preferable to rendering on a graphics workstation due to the MPP`s abundant resources: memory, disk, and numerous processors. The challenge becomes developing algorithms that can exploit these resources while minimizing overhead, typically communication costs. This paper will describe recent efforts in parallel rendering for polygonal primitives as well as parallel volumetric techniques. This paper presents rendering algorithms, developed for massively parallel processors (MPPs), for polygonal, spheres, and volumetric data. The polygon algorithm uses a data parallel approach whereas the sphere and volume render use a MIMD approach. Implementations for these algorithms are presented for the Thinking Ma.chines Corporation CM-5 MPP.

  5. FWT2D: A massively parallel program for frequency-domain full-waveform tomography of wide-aperture seismic data—Part 1: Algorithm

    NASA Astrophysics Data System (ADS)

    Sourbier, Florent; Operto, Stéphane; Virieux, Jean; Amestoy, Patrick; L'Excellent, Jean-Yves

    2009-03-01

    This is the first paper in a two-part series that describes a massively parallel code that performs 2D frequency-domain full-waveform inversion of wide-aperture seismic data for imaging complex structures. Full-waveform inversion methods, namely quantitative seismic imaging methods based on the resolution of the full wave equation, are computationally expensive. Therefore, designing efficient algorithms which take advantage of parallel computing facilities is critical for the appraisal of these approaches when applied to representative case studies and for further improvements. Full-waveform modelling requires the resolution of a large sparse system of linear equations which is performed with the massively parallel direct solver MUMPS for efficient multiple-shot simulations. Efficiency of the multiple-shot solution phase (forward/backward substitutions) is improved by using the BLAS3 library. The inverse problem relies on a classic local optimization approach implemented with a gradient method. The direct solver returns the multiple-shot wavefield solutions distributed over the processors according to a domain decomposition driven by the distribution of the LU factors. The domain decomposition of the wavefield solutions is used to compute in parallel the gradient of the objective function and the diagonal Hessian, this latter providing a suitable scaling of the gradient. The algorithm allows one to test different strategies for multiscale frequency inversion ranging from successive mono-frequency inversion to simultaneous multifrequency inversion. These different inversion strategies will be illustrated in the following companion paper. The parallel efficiency and the scalability of the code will also be quantified.

  6. West Virginia US Department of Energy experimental program to stimulate competitive research. Section 2: Human resource development; Section 3: Carbon-based structural materials research cluster; Section 3: Data parallel algorithms for scientific computing

    SciTech Connect

    Not Available

    1994-02-02

    This report consists of three separate but related reports. They are (1) Human Resource Development, (2) Carbon-based Structural Materials Research Cluster, and (3) Data Parallel Algorithms for Scientific Computing. To meet the objectives of the Human Resource Development plan, the plan includes K--12 enrichment activities, undergraduate research opportunities for students at the state`s two Historically Black Colleges and Universities, graduate research through cluster assistantships and through a traineeship program targeted specifically to minorities, women and the disabled, and faculty development through participation in research clusters. One research cluster is the chemistry and physics of carbon-based materials. The objective of this cluster is to develop a self-sustaining group of researchers in carbon-based materials research within the institutions of higher education in the state of West Virginia. The projects will involve analysis of cokes, graphites and other carbons in order to understand the properties that provide desirable structural characteristics including resistance to oxidation, levels of anisotropy and structural characteristics of the carbons themselves. In the proposed cluster on parallel algorithms, research by four WVU faculty and three state liberal arts college faculty are: (1) modeling of self-organized critical systems by cellular automata; (2) multiprefix algorithms and fat-free embeddings; (3) offline and online partitioning of data computation; and (4) manipulating and rendering three dimensional objects. This cluster furthers the state Experimental Program to Stimulate Competitive Research plan by building on existing strengths at WVU in parallel algorithms.

  7. Nexus: An interoperability layer for parallel and distributed computer systems

    SciTech Connect

    Foster, I.; Kesselman, C.; Olson, R.; Tuecke, S.

    1994-05-01

    Nexus is a set of services that can be used to implement various task-parallel languages, data-parallel languages, and message-passing libraries. Nexus is designed to permit the efficient portable implementation of individual parallel programming systems and the interoperability of programs developed with different tools. Nexus supports lightweight threading and active message technology, allowing integration of message passing and threads.

  8. Massively Parallel QCD

    SciTech Connect

    Soltz, R; Vranas, P; Blumrich, M; Chen, D; Gara, A; Giampap, M; Heidelberger, P; Salapura, V; Sexton, J; Bhanot, G

    2007-04-11

    The theory of the strong nuclear force, Quantum Chromodynamics (QCD), can be numerically simulated from first principles on massively-parallel supercomputers using the method of Lattice Gauge Theory. We describe the special programming requirements of lattice QCD (LQCD) as well as the optimal supercomputer hardware architectures that it suggests. We demonstrate these methods on the BlueGene massively-parallel supercomputer and argue that LQCD and the BlueGene architecture are a natural match. This can be traced to the simple fact that LQCD is a regular lattice discretization of space into lattice sites while the BlueGene supercomputer is a discretization of space into compute nodes, and that both are constrained by requirements of locality. This simple relation is both technologically important and theoretically intriguing. The main result of this paper is the speedup of LQCD using up to 131,072 CPUs on the largest BlueGene/L supercomputer. The speedup is perfect with sustained performance of about 20% of peak. This corresponds to a maximum of 70.5 sustained TFlop/s. At these speeds LQCD and BlueGene are poised to produce the next generation of strong interaction physics theoretical results.

  9. RCCPAC: A parallel relativistic coupled-cluster program for closed-shell and one-valence atoms and ions in FORTRAN

    NASA Astrophysics Data System (ADS)

    Mani, B. K.; Chattopadhyay, S.; Angom, D.

    2017-04-01

    We report the development of a parallel FORTRAN code, RCCPAC, to solve the relativistic coupled-cluster equations for closed-shell and one-valence atoms and ions. The parallelization is implemented through the use of message passing interface, which is suitable for distributed memory computers. The coupled-cluster equations are defined in terms of the reduced matrix elements, and solved iteratively using Jacobi method. The ground and excited states of coupled-cluster wave functions obtained from the code could be used to compute different properties of closed-shell and one-valence atom or ion. As an example we compute the ground state correlation energy, attachment energies, E1 reduced matrix elements and hyperfine structure constants.

  10. Toward Parallel Document Clustering

    SciTech Connect

    Mogill, Jace A.; Haglin, David J.

    2011-09-01

    A key challenge to automated clustering of documents in large text corpora is the high cost of comparing documents in a multimillion dimensional document space. The Anchors Hierarchy is a fast data structure and algorithm for localizing data based on a triangle inequality obeying distance metric, the algorithm strives to minimize the number of distance calculations needed to cluster the documents into “anchors” around reference documents called “pivots”. We extend the original algorithm to increase the amount of available parallelism and consider two implementations: a complex data structure which affords efficient searching, and a simple data structure which requires repeated sorting. The sorting implementation is integrated with a text corpora “Bag of Words” program and initial performance results of end-to-end a document processing workflow are reported.

  11. Parallel BLAST on split databases.

    PubMed

    Mathog, David R

    2003-09-22

    BLAST programs often run on large SMP machines where multiple threads can work simultaneously and there is enough memory to cache the databases between program runs. A group of programs is described which allows comparable performance to be achieved with a Beowulf configuration in which no node has enough memory to cache a database but the cluster as an aggregate does. To achieve this result, databases are split into equal sized pieces and stored locally on each node. Each query is run on all nodes in parallel and the resultant BLAST output files from all nodes merged to yield the final output. Source code is available from ftp://saf.bio.caltech.edu/

  12. The language parallel Pascal and other aspects of the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Reeves, A. P.; Bruner, J. D.

    1982-01-01

    A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.

  13. Parallel Adaptive Mesh Refinement Library

    NASA Technical Reports Server (NTRS)

    Mac-Neice, Peter; Olson, Kevin

    2005-01-01

    Parallel Adaptive Mesh Refinement Library (PARAMESH) is a package of Fortran 90 subroutines designed to provide a computer programmer with an easy route to extension of (1) a previously written serial code that uses a logically Cartesian structured mesh into (2) a parallel code with adaptive mesh refinement (AMR). Alternatively, in its simplest use, and with minimal effort, PARAMESH can operate as a domain-decomposition tool for users who want to parallelize their serial codes but who do not wish to utilize adaptivity. The package builds a hierarchy of sub-grids to cover the computational domain of a given application program, with spatial resolution varying to satisfy the demands of the application. The sub-grid blocks form the nodes of a tree data structure (a quad-tree in two or an oct-tree in three dimensions). Each grid block has a logically Cartesian mesh. The package supports one-, two- and three-dimensional models.

  14. High-level multireference methods in the quantum-chemistry program system COLUMBUS : analytic MR-CISD and MR-AQCC gradients and MR-AQCC-LRT for excited states, GUGA spin-orbit CI, and parallel CI density.

    SciTech Connect

    Lischka, H.; Shepard, R.; Pitzer, R. M.; Shavitt, I.; Dallos, M.; Muller, T.; Szalay, P. G.; Seth, M.; Kedziora, G. S.; Yabushita, S.; Zhang, Z.; Chemistry; Univ. of Vienna; Ohio State Univ.; Univ. of Illinois; Eotvos Lorand Univ.; Univ. of Calgary; Northwestern Univ.; Keio Univ.; PNNL

    2001-01-01

    Development of several new computational approaches within the framework of multi-reference ab initio molecular electronic structure methodology and their implementation in the COLUMBUS program system are reported. These new features are: calculation of the analytical MR-CI gradient for excited states based on state-averaged MCSCF orbitals, the extension of the MR-ACPF/AQCC methods to excited states in the framework of linear-response theory, spin-orbit CI for molecules containing heavy atoms and the development of a massively-parallel code for the computation of the one- and two-particle density matrix elements. Illustrative examples are given for each of these cases.

  15. An object-oriented approach to nested data parallelism

    NASA Technical Reports Server (NTRS)

    Sheffler, Thomas J.; Chatterjee, Siddhartha

    1994-01-01

    This paper describes an implementation technique for integrating nested data parallelism into an object-oriented language. Data-parallel programming employs sets of data called 'collections' and expresses parallelism as operations performed over the elements of a collection. When the elements of a collection are also collections, then there is the possibility for 'nested data parallelism.' Few current programming languages support nested data parallelism however. In an object-oriented framework, a collection is a single object. Its type defines the parallel operations that may be applied to it. Our goal is to design and build an object-oriented data-parallel programming environment supporting nested data parallelism. Our initial approach is built upon three fundamental additions to C++. We add new parallel base types by implementing them as classes, and add a new parallel collection type called a 'vector' that is implemented as a template. Only one new language feature is introduced: the 'foreach' construct, which is the basis for exploiting elementwise parallelism over collections. The strength of the method lies in the compilation strategy, which translates nested data-parallel C++ into ordinary C++. Extracting the potential parallelism in nested 'foreach' constructs is called 'flattening' nested parallelism. We show how to flatten 'foreach' constructs using a simple program transformation. Our prototype system produces vector code which has been successfully run on workstations, a CM-2, and a CM-5.

  16. Parallel flow diffusion battery

    DOEpatents

    Yeh, Hsu-Chi; Cheng, Yung-Sung

    1984-08-07

    A parallel flow diffusion battery for determining the mass distribution of an aerosol has a plurality of diffusion cells mounted in parallel to an aerosol stream, each diffusion cell including a stack of mesh wire screens of different density.

  17. Parallel flow diffusion battery

    DOEpatents

    Yeh, H.C.; Cheng, Y.S.

    1984-01-01

    A parallel flow diffusion battery for determining the mass distribution of an aerosol has a plurality of diffusion cells mounted in parallel to an aerosol stream, each diffusion cell including a stack of mesh wire screens of different density.

  18. PALM: a Parallel Dynamic Coupler

    NASA Astrophysics Data System (ADS)

    Thevenin, A.; Morel, T.

    2008-12-01

    In order to efficiently represent complex systems, numerical modeling has to rely on many physical models at a time: an ocean model coupled with an atmospheric model is at the basis of climate modeling. The continuity of the solution is granted only if these models can constantly exchange information. PALM is a coupler allowing the concurrent execution and the intercommunication of programs not having been especially designed for that. With PALM, the dynamic coupling approach is introduced: a coupled component can be launched and can release computers' resources upon termination at any moment during the simulation. In order to exploit as much as possible computers' possibilities, the PALM coupler handles two levels of parallelism. The first level concerns the components themselves. While managing the resources, PALM allocates the number of processes which are necessary to any coupled component. These models can be parallel programs based on domain decomposition with MPI or applications multithreaded with OpenMP. The second level of parallelism is a task parallelism: one can define a coupling algorithm allowing two or more programs to be executed in parallel. PALM applications are implemented via a Graphical User Interface called PrePALM. In this GUI, the programmer initially defines the coupling algorithm then he describes the actual communications between the models. PALM offers a very high flexibility for testing different coupling techniques and for reaching the best load balance in a high performance computer. The transformation of computational independent code is almost straightforward. The other qualities of PALM are its easy set-up, its flexibility, its performances, the simple updates and evolutions of the coupled application and the many side services and functions that it offers.

  19. Parallel processing ITS

    SciTech Connect

    Fan, W.C.; Halbleib, J.A. Sr.

    1996-09-01

    This report provides a users` guide for parallel processing ITS on a UNIX workstation network, a shared-memory multiprocessor or a massively-parallel processor. The parallelized version of ITS is based on a master/slave model with message passing. Parallel issues such as random number generation, load balancing, and communication software are briefly discussed. Timing results for example problems are presented for demonstration purposes.

  20. Parallel distributed computing using Python

    NASA Astrophysics Data System (ADS)

    Dalcin, Lisandro D.; Paz, Rodrigo R.; Kler, Pablo A.; Cosimo, Alejandro

    2011-09-01

    This work presents two software components aimed to relieve the costs of accessing high-performance parallel computing resources within a Python programming environment: MPI for Python and PETSc for Python. MPI for Python is a general-purpose Python package that provides bindings for the Message Passing Interface (MPI) standard using any back-end MPI implementation. Its facilities allow parallel Python programs to easily exploit multiple processors using the message passing paradigm. PETSc for Python provides access to the Portable, Extensible Toolkit for Scientific Computation (PETSc) libraries. Its facilities allow sequential and parallel Python applications to exploit state of the art algorithms and data structures readily available in PETSc for the solution of large-scale problems in science and engineering. MPI for Python and PETSc for Python are fully integrated to PETSc-FEM, an MPI and PETSc based parallel, multiphysics, finite elements code developed at CIMEC laboratory. This software infrastructure supports research activities related to simulation of fluid flows with applications ranging from the design of microfluidic devices for biochemical analysis to modeling of large-scale stream/aquifer interactions.

  1. Stage-by-Stage and Parallel Flow Path Compressor Modeling for a Variable Cycle Engine, NASA Advanced Air Vehicles Program - Commercial Supersonic Technology Project - AeroServoElasticity

    NASA Technical Reports Server (NTRS)

    Kopasakis, George; Connolly, Joseph W.; Cheng, Larry

    2015-01-01

    This paper covers the development of stage-by-stage and parallel flow path compressor modeling approaches for a Variable Cycle Engine. The stage-by-stage compressor modeling approach is an extension of a technique for lumped volume dynamics and performance characteristic modeling. It was developed to improve the accuracy of axial compressor dynamics over lumped volume dynamics modeling. The stage-by-stage compressor model presented here is formulated into a parallel flow path model that includes both axial and rotational dynamics. This is done to enable the study of compressor and propulsion system dynamic performance under flow distortion conditions. The approaches utilized here are generic and should be applicable for the modeling of any axial flow compressor design accurate time domain simulations. The objective of this work is as follows. Given the parameters describing the conditions of atmospheric disturbances, and utilizing the derived formulations, directly compute the transfer function poles and zeros describing these disturbances for acoustic velocity, temperature, pressure, and density. Time domain simulations of representative atmospheric turbulence can then be developed by utilizing these computed transfer functions together with the disturbance frequencies of interest.

  2. Parallel simulation today

    NASA Technical Reports Server (NTRS)

    Nicol, David; Fujimoto, Richard

    1992-01-01

    This paper surveys topics that presently define the state of the art in parallel simulation. Included in the tutorial are discussions on new protocols, mathematical performance analysis, time parallelism, hardware support for parallel simulation, load balancing algorithms, and dynamic memory management for optimistic synchronization.

  3. Research in parallel computing

    NASA Technical Reports Server (NTRS)

    Ortega, James M.; Henderson, Charles

    1994-01-01

    This report summarizes work on parallel computations for NASA Grant NAG-1-1529 for the period 1 Jan. - 30 June 1994. Short summaries on highly parallel preconditioners, target-specific parallel reductions, and simulation of delta-cache protocols are provided.

  4. Implementation and performance of parallel Prolog interpreter

    SciTech Connect

    Wei, S.; Kale, L.V.; Balkrishna, R. . Dept. of Computer Science)

    1988-01-01

    In this paper, the authors discuss the implementation of a parallel Prolog interpreter on different parallel machines. The implementation is based on the REDUCE--OR process model which exploits both AND and OR parallelism in logic programs. It is machine independent as it runs on top of the chare-kernel--a machine-independent parallel programming system. The authors also give the performance of the interpreter running a diverse set of benchmark pargrams on parallel machines including shared memory systems: an Alliant FX/8, Sequent and a MultiMax, and a non-shared memory systems: Intel iPSC/32 hypercube, in addition to its performance on a multiprocessor simulation system.

  5. Parallel computing for probabilistic fatigue analysis

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Lua, Yuan J.; Smith, Mark D.

    1993-01-01

    This paper presents the results of Phase I research to investigate the most effective parallel processing software strategies and hardware configurations for probabilistic structural analysis. We investigate the efficiency of both shared and distributed-memory architectures via a probabilistic fatigue life analysis problem. We also present a parallel programming approach, the virtual shared-memory paradigm, that is applicable across both types of hardware. Using this approach, problems can be solved on a variety of parallel configurations, including networks of single or multiprocessor workstations. We conclude that it is possible to effectively parallelize probabilistic fatigue analysis codes; however, special strategies will be needed to achieve large-scale parallelism to keep large number of processors busy and to treat problems with the large memory requirements encountered in practice. We also conclude that distributed-memory architecture is preferable to shared-memory for achieving large scale parallelism; however, in the future, the currently emerging hybrid-memory architectures will likely be optimal.

  6. Parallel processing for scientific computations

    NASA Technical Reports Server (NTRS)

    Alkhatib, Hasan S.

    1991-01-01

    The main contribution of the effort in the last two years is the introduction of the MOPPS system. After doing extensive literature search, we introduced the system which is described next. MOPPS employs a new solution to the problem of managing programs which solve scientific and engineering applications on a distributed processing environment. Autonomous computers cooperate efficiently in solving large scientific problems with this solution. MOPPS has the advantage of not assuming the presence of any particular network topology or configuration, computer architecture, or operating system. It imposes little overhead on network and processor resources while efficiently managing programs concurrently. The core of MOPPS is an intelligent program manager that builds a knowledge base of the execution performance of the parallel programs it is managing under various conditions. The manager applies this knowledge to improve the performance of future runs. The program manager learns from experience.

  7. Fully Parallel MHD Stability Analysis Tool

    NASA Astrophysics Data System (ADS)

    Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

    2015-11-01

    Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Results of MARS parallelization and of the development of a new fix boundary equilibrium code adapted for MARS input will be reported. Work is supported by the U.S. DOE SBIR program.

  8. Fully Parallel MHD Stability Analysis Tool

    NASA Astrophysics Data System (ADS)

    Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

    2014-10-01

    Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Initial results of the code parallelization will be reported. Work is supported by the U.S. DOE SBIR program.

  9. Fully Parallel MHD Stability Analysis Tool

    NASA Astrophysics Data System (ADS)

    Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

    2013-10-01

    Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Preliminary results of the code parallelization will be reported. Work is supported by the U.S. DOE SBIR program.

  10. Computer-Aided Parallelizer and Optimizer

    NASA Technical Reports Server (NTRS)

    Jin, Haoqiang

    2011-01-01

    The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.

  11. Merlin - Massively parallel heterogeneous computing

    NASA Technical Reports Server (NTRS)

    Wittie, Larry; Maples, Creve

    1989-01-01

    Hardware and software for Merlin, a new kind of massively parallel computing system, are described. Eight computers are linked as a 300-MIPS prototype to develop system software for a larger Merlin network with 16 to 64 nodes, totaling 600 to 3000 MIPS. These working prototypes help refine a mapped reflective memory technique that offers a new, very general way of linking many types of computer to form supercomputers. Processors share data selectively and rapidly on a word-by-word basis. Fast firmware virtual circuits are reconfigured to match topological needs of individual application programs. Merlin's low-latency memory-sharing interfaces solve many problems in the design of high-performance computing systems. The Merlin prototypes are intended to run parallel programs for scientific applications and to determine hardware and software needs for a future Teraflops Merlin network.

  12. Merlin - Massively parallel heterogeneous computing

    NASA Technical Reports Server (NTRS)

    Wittie, Larry; Maples, Creve

    1989-01-01

    Hardware and software for Merlin, a new kind of massively parallel computing system, are described. Eight computers are linked as a 300-MIPS prototype to develop system software for a larger Merlin network with 16 to 64 nodes, totaling 600 to 3000 MIPS. These working prototypes help refine a mapped reflective memory technique that offers a new, very general way of linking many types of computer to form supercomputers. Processors share data selectively and rapidly on a word-by-word basis. Fast firmware virtual circuits are reconfigured to match topological needs of individual application programs. Merlin's low-latency memory-sharing interfaces solve many problems in the design of high-performance computing systems. The Merlin prototypes are intended to run parallel programs for scientific applications and to determine hardware and software needs for a future Teraflops Merlin network.

  13. Parallel Computing in SCALE

    SciTech Connect

    DeHart, Mark D; Williams, Mark L; Bowman, Stephen M

    2010-01-01

    The SCALE computational architecture has remained basically the same since its inception 30 years ago, although constituent modules and capabilities have changed significantly. This SCALE concept was intended to provide a framework whereby independent codes can be linked to provide a more comprehensive capability than possible with the individual programs - allowing flexibility to address a wide variety of applications. However, the current system was designed originally for mainframe computers with a single CPU and with significantly less memory than today's personal computers. It has been recognized that the present SCALE computation system could be restructured to take advantage of modern hardware and software capabilities, while retaining many of the modular features of the present system. Preliminary work is being done to define specifications and capabilities for a more advanced computational architecture. This paper describes the state of current SCALE development activities and plans for future development. With the release of SCALE 6.1 in 2010, a new phase of evolutionary development will be available to SCALE users within the TRITON and NEWT modules. The SCALE (Standardized Computer Analyses for Licensing Evaluation) code system developed by Oak Ridge National Laboratory (ORNL) provides a comprehensive and integrated package of codes and nuclear data for a wide range of applications in criticality safety, reactor physics, shielding, isotopic depletion and decay, and sensitivity/uncertainty (S/U) analysis. Over the last three years, since the release of version 5.1 in 2006, several important new codes have been introduced within SCALE, and significant advances applied to existing codes. Many of these new features became available with the release of SCALE 6.0 in early 2009. However, beginning with SCALE 6.1, a first generation of parallel computing is being introduced. In addition to near-term improvements, a plan for longer term SCALE enhancement

  14. Parallel algorithm development

    SciTech Connect

    Adams, T.F.

    1996-06-01

    Rapid changes in parallel computing technology are causing significant changes in the strategies being used for parallel algorithm development. One approach is simply to write computer code in a standard language like FORTRAN 77 or with the expectation that the compiler will produce executable code that will run in parallel. The alternatives are: (1) to build explicit message passing directly into the source code; or (2) to write source code without explicit reference to message passing or parallelism, but use a general communications library to provide efficient parallel execution. Application of these strategies is illustrated with examples of codes currently under development.

  15. Parallel Computational Protein Design

    PubMed Central

    Zhou, Yichao; Donald, Bruce R.; Zeng, Jianyang

    2016-01-01

    Computational structure-based protein design (CSPD) is an important problem in computational biology, which aims to design or improve a prescribed protein function based on a protein structure template. It provides a practical tool for real-world protein engineering applications. A popular CSPD method that guarantees to find the global minimum energy solution (GMEC) is to combine both dead-end elimination (DEE) and A* tree search algorithms. However, in this framework, the A* search algorithm can run in exponential time in the worst case, which may become the computation bottleneck of large-scale computational protein design process. To address this issue, we extend and add a new module to the OSPREY program that was previously developed in the Donald lab [1] to implement a GPU-based massively parallel A* algorithm for improving protein design pipeline. By exploiting the modern GPU computational framework and optimizing the computation of the heuristic function for A* search, our new program, called gOSPREY, can provide up to four orders of magnitude speedups in large protein design cases with a small memory overhead comparing to the traditional A* search algorithm implementation, while still guaranteeing the optimality. In addition, gOSPREY can be configured to run in a bounded-memory mode to tackle the problems in which the conformation space is too large and the global optimal solution cannot be computed previously. Furthermore, the GPU-based A* algorithm implemented in the gOSPREY program can be combined with the state-of-the-art rotamer pruning algorithms such as iMinDEE [2] and DEEPer [3] to also consider continuous backbone and side-chain flexibility. PMID:27914056

  16. Parallel Atomistic Simulations

    SciTech Connect

    HEFFELFINGER,GRANT S.

    2000-01-18

    Algorithms developed to enable the use of atomistic molecular simulation methods with parallel computers are reviewed. Methods appropriate for bonded as well as non-bonded (and charged) interactions are included. While strategies for obtaining parallel molecular simulations have been developed for the full variety of atomistic simulation methods, molecular dynamics and Monte Carlo have received the most attention. Three main types of parallel molecular dynamics simulations have been developed, the replicated data decomposition, the spatial decomposition, and the force decomposition. For Monte Carlo simulations, parallel algorithms have been developed which can be divided into two categories, those which require a modified Markov chain and those which do not. Parallel algorithms developed for other simulation methods such as Gibbs ensemble Monte Carlo, grand canonical molecular dynamics, and Monte Carlo methods for protein structure determination are also reviewed and issues such as how to measure parallel efficiency, especially in the case of parallel Monte Carlo algorithms with modified Markov chains are discussed.

  17. Parallel Logic Programming and Parallel Systems Software and Hardware

    DTIC Science & Technology

    1989-07-29

    MARYLAND DEPART~mT Of COMMIR SCMNCI Cot.LLo PA K. MYLAA3 20742 -O iRT , 9i 3 0 9 Telephone (301) 454-6119 July 28, 1989 Carey M. Fountain , Major, USAF...Department of the Air Force Air Force Office of Scientific Research Boiling Air Force Base Washington, DC 20332-6448 Dear Major Fountain : Enclosed is...K. " SODA : The Software Designer’s Aide" Technical Report (in preparation), Computer Science Department, University of Maryland, June, 1986. (27

  18. Parallel Computing Strategies for Irregular Algorithms

    NASA Technical Reports Server (NTRS)

    Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)

    2002-01-01

    Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.

  19. Parallel methods for the flight simulation model

    SciTech Connect

    Xiong, Wei Zhong; Swietlik, C.

    1994-06-01

    The Advanced Computer Applications Center (ACAC) has been involved in evaluating advanced parallel architecture computers and the applicability of these machines to computer simulation models. The advanced systems investigated include parallel machines with shared. memory and distributed architectures consisting of an eight processor Alliant FX/8, a twenty four processor sor Sequent Symmetry, Cray XMP, IBM RISC 6000 model 550, and the Intel Touchstone eight processor Gamma and 512 processor Delta machines. Since parallelizing a truly efficient application program for the parallel machine is a difficult task, the implementation for these machines in a realistic setting has been largely overlooked. The ACAC has developed considerable expertise in optimizing and parallelizing application models on a collection of advanced multiprocessor systems. One of aspect of such an application model is the Flight Simulation Model, which used a set of differential equations to describe the flight characteristics of a launched missile by means of a trajectory. The Flight Simulation Model was written in the FORTRAN language with approximately 29,000 lines of source code. Depending on the number of trajectories, the computation can require several hours to full day of CPU time on DEC/VAX 8650 system. There is an impetus to reduce the execution time and utilize the advanced parallel architecture computing environment available. ACAC researchers developed a parallel method that allows the Flight Simulation Model to be able to run in parallel on the multiprocessor system. For the benchmark data tested, the parallel Flight Simulation Model implemented on the Alliant FX/8 has achieved nearly linear speedup. In this paper, we describe a parallel method for the Flight Simulation Model. We believe the method presented in this paper provides a general concept for the design of parallel applications. This concept, in most cases, can be adapted to many other sequential application programs.

  20. Recent Improvements to HST Parallel Scheduling

    NASA Astrophysics Data System (ADS)

    Henry, Ronald; Butschky, Mike

    The Hubble Space Telescope (HST) has several scientific instruments (SIs) that may be used at any given time. Most primary visits submitted by HST observers only use one SI, leaving the other SIs free to be requested by ``pure parallel'' observing programs. In order to accomplish this, separate scheduling units (SUs) for each parallel SI must be created and then scheduled by the Science Planning and Scheduling System (SPSS), taking into account numerous orbital and scientific constraints. The Parallel Observation Matching System (POMS) has the task of matching parallel visits to primary observations and ``crafting'' appropriate parallel SUs at each opportunity, taking scientific criteria and orbital constraints into account. The process for planning and scheduling parallel observations is thus quite different from the process for primary science. In the past, custom crafting rules for each parallel program were necessary, requiring full-time support from a software developer. In addition, because POMS ran as a standalone system, its ability to model how long parallel SUs would take was limited, especially with the flexible buffer-management schemes used for the second-generation SIs. A new version of POMS was developed in 1997. This version uses a formal proposal syntax (the same used for primary observations) for parallels, so that different proposals can be handled uniformly and without the need for customized ``crafting rules.'' In addition, POMS is integrated with the Transformation (TRANS) planning system in order to give it full knowledge of overheads within an SU, eliminating the need for ad hoc modeling. The power and versatility of this approach has paid off in improved utilization of parallel opportunities, greatly reduced maintenance costs, and an ability to gracefully handle new parallel proposals and new SIs with minimal software effort. This paper discusses the requirements, design, and operational results of the new POMS.

  1. An Evaluation of Kernel Equating: Parallel Equating with Classical Methods in the SAT Subject Tests[TM] Program. Research Report. ETS RR-09-06

    ERIC Educational Resources Information Center

    Grant, Mary C.; Zhang, Lilly; Damiano, Michele

    2009-01-01

    This study investigated kernel equating methods by comparing these methods to operational equatings for two tests in the SAT Subject Tests[TM] program. GENASYS (ETS, 2007) was used for all equating methods and scaled score kernel equating results were compared to Tucker, Levine observed score, chained linear, and chained equipercentile equating…

  2. Partitioning problems in parallel, pipelined and distributed computing

    NASA Technical Reports Server (NTRS)

    Bokhari, S.

    1985-01-01

    The problem of optimally assigning the modules of a parallel program over the processors of a multiple computer system is addressed. A Sum-Bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple satellite system: partitioning multiple chain structured parallel programs, multiple arbitrarily structured serial programs and single tree structured parallel programs. In addition, the problems of partitioning chain structured parallel programs across chain connected systems and across shared memory (or shared bus) systems are also solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple computer architectures for a wide range of problems of practical interest.

  3. Partitioning problems in parallel, pipelined, and distributed computing

    NASA Technical Reports Server (NTRS)

    Bokhari, Shahid H.

    1988-01-01

    The problem of optimally assigning the modules of a parallel program over the processors of a multiple-computer system is addressed. A sum-bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple-satellite system: partitioning multiple chain-structured parallel programs, multiple arbitrarily structured serial programs, and single-tree structured parallel programs. In addition, the problem of partitioning chain-structured parallel programs across chain-connected systems is solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple-computer architectures for a wide range of problems of practical interest.

  4. Parallel digital forensics infrastructure.

    SciTech Connect

    Liebrock, Lorie M.; Duggan, David Patrick

    2009-10-01

    This report documents the architecture and implementation of a Parallel Digital Forensics infrastructure. This infrastructure is necessary for supporting the design, implementation, and testing of new classes of parallel digital forensics tools. Digital Forensics has become extremely difficult with data sets of one terabyte and larger. The only way to overcome the processing time of these large sets is to identify and develop new parallel algorithms for performing the analysis. To support algorithm research, a flexible base infrastructure is required. A candidate architecture for this base infrastructure was designed, instantiated, and tested by this project, in collaboration with New Mexico Tech. Previous infrastructures were not designed and built specifically for the development and testing of parallel algorithms. With the size of forensics data sets only expected to increase significantly, this type of infrastructure support is necessary for continued research in parallel digital forensics. This report documents the implementation of the parallel digital forensics (PDF) infrastructure architecture and implementation.

  5. Parallel, Distributed Scripting with Python

    SciTech Connect

    Miller, P J

    2002-05-24

    Parallel computers used to be, for the most part, one-of-a-kind systems which were extremely difficult to program portably. With SMP architectures, the advent of the POSIX thread API and OpenMP gave developers ways to portably exploit on-the-box shared memory parallelism. Since these architectures didn't scale cost-effectively, distributed memory clusters were developed. The associated MPI message passing libraries gave these systems a portable paradigm too. Having programmers effectively use this paradigm is a somewhat different question. Distributed data has to be explicitly transported via the messaging system in order for it to be useful. In high level languages, the MPI library gives access to data distribution routines in C, C++, and FORTRAN. But we need more than that. Many reasonable and common tasks are best done in (or as extensions to) scripting languages. Consider sysadm tools such as password crackers, file purgers, etc ... These are simple to write in a scripting language such as Python (an open source, portable, and freely available interpreter). But these tasks beg to be done in parallel. Consider the a password checker that checks an encrypted password against a 25,000 word dictionary. This can take around 10 seconds in Python (6 seconds in C). It is trivial to parallelize if you can distribute the information and co-ordinate the work.

  6. Parallel LU-factorization algorithms for dense matrices

    SciTech Connect

    Oppe, T.C.; Kincaid, D.R.

    1987-05-01

    Several serial and parallel algorithms for computing the LU-factorization of a dense matrix are investigated. Numerical experiments and programming considerations to reduce bank conflicts on the Cray X-MP4 parallel computer are presented. Speedup factors are given for the parallel algorithms. 15 refs., 6 tabs.

  7. Automatic Management of Parallel and Distributed System Resources

    NASA Technical Reports Server (NTRS)

    Yan, Jerry; Ngai, Tin Fook; Lundstrom, Stephen F.

    1990-01-01

    Viewgraphs on automatic management of parallel and distributed system resources are presented. Topics covered include: parallel applications; intelligent management of multiprocessing systems; performance evaluation of parallel architecture; dynamic concurrent programs; compiler-directed system approach; lattice gaseous cellular automata; and sparse matrix Cholesky factorization.

  8. Introduction to Parallel Computing

    DTIC Science & Technology

    1992-05-01

    Topology C, Ada, C++, Data-parallel FORTRAN, 2D mesh of node boards, each node FORTRAN-90 (late 1992) board has 1 application processor Devopment Tools ...parallel machines become the wave of the present, tools are increasingly needed to assist programmers in creating parallel tasks and coordinating...their activities. Linda was designed to be such a tool . Linda was designed with three important goals in mind: to be portable, efficient, and easy to use

  9. Parallel Wolff Cluster Algorithms

    NASA Astrophysics Data System (ADS)

    Bae, S.; Ko, S. H.; Coddington, P. D.

    The Wolff single-cluster algorithm is the most efficient method known for Monte Carlo simulation of many spin models. Due to the irregular size, shape and position of the Wolff clusters, this method does not easily lend itself to efficient parallel implementation, so that simulations using this method have thus far been confined to workstations and vector machines. Here we present two parallel implementations of this algorithm, and show that one gives fairly good performance on a MIMD parallel computer.

  10. PCLIPS: Parallel CLIPS

    NASA Technical Reports Server (NTRS)

    Hall, Lawrence O.; Bennett, Bonnie H.; Tello, Ivan

    1994-01-01

    A parallel version of CLIPS 5.1 has been developed to run on Intel Hypercubes. The user interface is the same as that for CLIPS with some added commands to allow for parallel calls. A complete version of CLIPS runs on each node of the hypercube. The system has been instrumented to display the time spent in the match, recognize, and act cycles on each node. Only rule-level parallelism is supported. Parallel commands enable the assertion and retraction of facts to/from remote nodes working memory. Parallel CLIPS was used to implement a knowledge-based command, control, communications, and intelligence (C(sup 3)I) system to demonstrate the fusion of high-level, disparate sources. We discuss the nature of the information fusion problem, our approach, and implementation. Parallel CLIPS has also be used to run several benchmark parallel knowledge bases such as one to set up a cafeteria. Results show from running Parallel CLIPS with parallel knowledge base partitions indicate that significant speed increases, including superlinear in some cases, are possible.

  11. Parallel Algorithms and Patterns

    SciTech Connect

    Robey, Robert W.

    2016-06-16

    This is a powerpoint presentation on parallel algorithms and patterns. A parallel algorithm is a well-defined, step-by-step computational procedure that emphasizes concurrency to solve a problem. Examples of problems include: Sorting, searching, optimization, matrix operations. A parallel pattern is a computational step in a sequence of independent, potentially concurrent operations that occurs in diverse scenarios with some frequency. Examples are: Reductions, prefix scans, ghost cell updates. We only touch on parallel patterns in this presentation. It really deserves its own detailed discussion which Gabe Rockefeller would like to develop.

  12. Interfacing Computer Aided Parallelization and Performance Analysis

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Biegel, Bryan A. (Technical Monitor)

    2003-01-01

    When porting sequential applications to parallel computer architectures, the program developer will typically go through several cycles of source code optimization and performance analysis. We have started a project to develop an environment where the user can jointly navigate through program structure and performance data information in order to make efficient optimization decisions. In a prototype implementation we have interfaced the CAPO computer aided parallelization tool with the Paraver performance analysis tool. We describe both tools and their interface and give an example for how the interface helps within the program development cycle of a benchmark code.

  13. CTRANS: A Monte Carlo program for radiative transfer in plane parallel atmospheres with imbedded finite clouds: Development, testing and user's guide

    NASA Technical Reports Server (NTRS)

    1976-01-01

    The program called CTRANS is described which was designed to perform radiative transfer computations in an atmosphere with horizontal inhomogeneities (clouds). Since the atmosphere-ground system was to be richly detailed, the Monte Carlo method was employed. This means that results are obtained through direct modeling of the physical process of radiative transport. The effects of atmopheric or ground albedo pattern detail are essentially built up from their impact upon the transport of individual photons. The CTRANS program actually tracks the photons backwards through the atmosphere, initiating them at a receiver and following them backwards along their path to the Sun. The pattern of incident photons generated through backwards tracking automatically reflects the importance to the receiver of each region of the sky. Further, through backwards tracking, the impact of the finite field of view of the receiver and variations in its response over the field of view can be directly simulated.

  14. Parallel Performance of a Combustion Chemistry Simulation

    DOE PAGES

    Skinner, Gregg; Eigenmann, Rudolf

    1995-01-01

    We used a description of a combustion simulation's mathematical and computational methods to develop a version for parallel execution. The result was a reasonable performance improvement on small numbers of processors. We applied several important programming techniques, which we describe, in optimizing the application. This work has implications for programming languages, compiler design, and software engineering.

  15. On the parallelization of molecular dynamics codes

    NASA Astrophysics Data System (ADS)

    Trabado, G. P.; Plata, O.; Zapata, E. L.

    2002-08-01

    Molecular dynamics (MD) codes present a high degree of spatial data locality and a significant amount of independent computations. However, most of the parallelization strategies are usually based on the manual transformation of sequential programs either by completely rewriting the code with message passing routines or using specific libraries intended for writing new MD programs. In this paper we propose a new library-based approach (DDLY) which supports parallelization of existing short-range MD sequential codes. The novelty of this approach is that it can directly handle the distribution of common data structures used in MD codes to represent data (arrays, Verlet lists, link cells), using domain decomposition. Thus, the insertion of run-time support for distribution and communication in a MD program does not imply significant changes to its structure. The method is simple, efficient and portable. It may be also used to extend existing parallel programming languages, such as HPF.

  16. Parallelization of a Compositional Reservoir Simulator

    NASA Astrophysics Data System (ADS)

    Reme, Hilde; Åge Øye, Geir; Espedal, Magne S.; Fladmark, Gunnar E.

    A finite volume dicretization has been used to solve compositional flow in porous media. Secondary migration in fractured rocks has been the main motivation for the work. Multipoint flux approximation has been implemented and adaptive local grid refinement, based on domain decomposition, is used at fractures and faults. The parallelization method, which is described in this paper, strongly promotes code reuse and gives a very high level of parallelization despite low implementation costs. The programming framework is also portable to other platforms or other applications. We have presented computer experiments to examine the parallel efficiency of the implemented parallel simulator with respect to scalability and speedup. Keywords: porous media, multipoint flux approximation, domain decomposition, parallelization

  17. Parallels in History.

    ERIC Educational Resources Information Center

    Mugleston, William F.

    2000-01-01

    Believes that by focusing on the recurrent situations and problems, or parallels, throughout history, students will understand the relevance of history to their own times and lives. Provides suggestions for parallels in history that may be introduced within lectures or as a means to class discussions. (CMK)

  18. Design and Analysis of Scalable Parallel Algorithms

    DTIC Science & Technology

    1993-11-15

    Journal of Parallel Programming, 20(2), 1991. 8 Conference Proceedings "* Anshul Gupta, Vipin Kumar and Ahmed Sameh . Performance and Scalability of Precon...Science, University of Minnesota, Minneapolis, 1993. o Anshul Gupta, Vipin Kumar and Ahmed Sameh . Performance and Scalability of Precondi- tioned...Ahmed Sameh . Performance and scalability of precondi- tioned conjugate gradient methods on parallel computers. Technical Report TR 92-64, Uni- versity

  19. Predicting Protein Structure Using Parallel Genetic Algorithms.

    DTIC Science & Technology

    1994-12-01

    34 IEEE Transactions on Systems, Man and Cybernetics, 10(9) (September 1980). 16. De Jong, Kenneth A. "On Using Genetic Algoriths to Search Program...By " Predicting rotein Structure D istribticfiar.. ................ Using Parallel Genetic Algorithms ,Avaiu " ’ •"... Dist THESIS I IGeorge H...iiLite-d Approved for public release; distribution unlimited AFIT/ GCS /ENG/94D-03 Predicting Protein Structure Using Parallel Genetic Algorithms

  20. Massively parallel mathematical sieves

    SciTech Connect

    Montry, G.R.

    1989-01-01

    The Sieve of Eratosthenes is a well-known algorithm for finding all prime numbers in a given subset of integers. A parallel version of the Sieve is described that produces computational speedups over 800 on a hypercube with 1,024 processing elements for problems of fixed size. Computational speedups as high as 980 are achieved when the problem size per processor is fixed. The method of parallelization generalizes to other sieves and will be efficient on any ensemble architecture. We investigate two highly parallel sieves using scattered decomposition and compare their performance on a hypercube multiprocessor. A comparison of different parallelization techniques for the sieve illustrates the trade-offs necessary in the design and implementation of massively parallel algorithms for large ensemble computers.

  1. Totally parallel multilevel algorithms

    NASA Technical Reports Server (NTRS)

    Frederickson, Paul O.

    1988-01-01

    Four totally parallel algorithms for the solution of a sparse linear system have common characteristics which become quite apparent when they are implemented on a highly parallel hypercube such as the CM2. These four algorithms are Parallel Superconvergent Multigrid (PSMG) of Frederickson and McBryan, Robust Multigrid (RMG) of Hackbusch, the FFT based Spectral Algorithm, and Parallel Cyclic Reduction. In fact, all four can be formulated as particular cases of the same totally parallel multilevel algorithm, which are referred to as TPMA. In certain cases the spectral radius of TPMA is zero, and it is recognized to be a direct algorithm. In many other cases the spectral radius, although not zero, is small enough that a single iteration per timestep keeps the local error within the required tolerance.

  2. Code Parallelization with CAPO: A User Manual

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Frumkin, Michael; Yan, Jerry; Biegel, Bryan (Technical Monitor)

    2001-01-01

    A software tool has been developed to assist the parallelization of scientific codes. This tool, CAPO, extends an existing parallelization toolkit, CAPTools developed at the University of Greenwich, to generate OpenMP parallel codes for shared memory architectures. This is an interactive toolkit to transform a serial Fortran application code to an equivalent parallel version of the software - in a small fraction of the time normally required for a manual parallelization. We first discuss the way in which loop types are categorized and how efficient OpenMP directives can be defined and inserted into the existing code using the in-depth interprocedural analysis. The use of the toolkit on a number of application codes ranging from benchmark to real-world application codes is presented. This will demonstrate the great potential of using the toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of processors. The second part of the document gives references to the parameters and the graphic user interface implemented in the toolkit. Finally a set of tutorials is included for hands-on experiences with this toolkit.

  3. NESL: A nested data-parallel language (version 2. 6)

    SciTech Connect

    Blelloch, G.E.

    1993-04-01

    This report describes NESL, a strongly-typed, applicative, data-parallel language. NESL is intended to be used as a portable interface for programming a variety of parallel and vector supercomputers, and as a basis for teaching parallel algorithms. Parallelism is supplied through a simple set of data-parallel constructs based on sequences (ordered sets), including a mechanism for applying any function over the elements of a sequence in parallel and a rich set of parallel functions that manipulate sequences. NESL fully supports nested sequences and nested parallelism -the ability to take a parallel function and apply it over multiple instances in parallel. Nested parallelism is important for implementing algorithms with complex and dynamically changing data structures, such as required in many graph and sparse matrix algorithms. NESL also provides a mechanism for calculating the asymptotic running time for a program on various parallel machine models, including the parallel random access machine (PRAM). This is useful for estimating running times of algorithms on actual machines and, when teaching algorithms, for supplying a close correspondence between the code and the theoretical complexity.

  4. Supercomputer Programming Environments.

    DTIC Science & Technology

    1987-10-30

    PARALLEL PROGRAMMING LANGUAGES ................................ 3 4 STATE-OF-THE- ART PROGRAMMING TOOLS...Further, the design of parallel algorithms is still a relatively new art . Multiprocessing leads to other difficulties as well. Programming multiple...the- art performance evaluation tools for existing single-processor machires are helpful when extended into the parallel domain, hut remain too

  5. The NAS parallel benchmarks

    NASA Technical Reports Server (NTRS)

    Bailey, David (Editor); Barton, John (Editor); Lasinski, Thomas (Editor); Simon, Horst (Editor)

    1993-01-01

    A new set of benchmarks was developed for the performance evaluation of highly parallel supercomputers. These benchmarks consist of a set of kernels, the 'Parallel Kernels,' and a simulated application benchmark. Together they mimic the computation and data movement characteristics of large scale computational fluid dynamics (CFD) applications. The principal distinguishing feature of these benchmarks is their 'pencil and paper' specification - all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional benchmarking approaches on highly parallel systems are avoided.

  6. PISCES: An environment for parallel scientific computation

    NASA Technical Reports Server (NTRS)

    Pratt, T. W.

    1985-01-01

    The parallel implementation of scientific computing environment (PISCES) is a project to provide high-level programming environments for parallel MIMD computers. Pisces 1, the first of these environments, is a FORTRAN 77 based environment which runs under the UNIX operating system. The Pisces 1 user programs in Pisces FORTRAN, an extension of FORTRAN 77 for parallel processing. The major emphasis in the Pisces 1 design is in providing a carefully specified virtual machine that defines the run-time environment within which Pisces FORTRAN programs are executed. Each implementation then provides the same virtual machine, regardless of differences in the underlying architecture. The design is intended to be portable to a variety of architectures. Currently Pisces 1 is implemented on a network of Apollo workstations and on a DEC VAX uniprocessor via simulation of the task level parallelism. An implementation for the Flexible Computing Corp. FLEX/32 is under construction. An introduction to the Pisces 1 virtual computer and the FORTRAN 77 extensions is presented. An example of an algorithm for the iterative solution of a system of equations is given. The most notable features of the design are the provision for several granularities of parallelism in programs and the provision of a window mechanism for distributed access to large arrays of data.

  7. Highly parallel sparse Cholesky factorization

    NASA Technical Reports Server (NTRS)

    Gilbert, John R.; Schreiber, Robert

    1990-01-01

    Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms.

  8. Effect of Combined Systematized Behavioral Modification Education Program With Desmopressin in Patients With Nocturia: A Prospective, Multicenter, Randomized, and Parallel Study

    PubMed Central

    Cho, Sung Yong; Lee, Kyu-Sung; Kim, Jang Hwan; Seo, Ju Tae; Choo, Myung-Soo; Kim, Joon Chul; Choi, Jong Bo; Song, Miho; Chun, Ji-Youn

    2014-01-01

    Purpose The aims of this study were to investigate the efficacy of combining the systematized behavioral modification program (SBMP) with desmopressin therapy and to compare this with desmopressin monotherapy in the treatment of nocturnal polyuria (NPU). Methods Patients were randomized at 8 centers to receive desmopressin monotherapy (group A) or combination therapy, comprising desmopressin and the SBMP (group B). Nocturia was defined as an average of 2 or more nightly voids. The primary endpoint was a change in the mean number of nocturnal voids from baseline during the 3-month treatment period. The secondary endpoints were changes in the bladder diary parameters and questionnaires scores, and improvements in self-perception for nocturia. Results A total of 200 patients were screened and 76 were excluded from the study, because they failed the screening process. A total of 124 patients were randomized to receive treatment, with group A comprising 68 patients and group B comprising 56 patients. The patients' characteristics were similar between the groups. Nocturnal voids showed a greater decline in group B (-1.5) compared with group A (-1.2), a difference that was not statistically significant. Significant differences were observed between groups A and B with respect to the NPU index (0.37 vs. 0.29, P=0.028), the change in the maximal bladder capacity (-41.3 mL vs. 13.3 mL, P<0.001), and the rate of patients lost to follow up (10.3% [7/68] vs. 0% [0/56], P=0.016). Self-perception for nocturia significantly improved in both groups. Conclusions Combination treatment did not have any additional benefits in relation to reducing nocturnal voids in patients with NPU; however, combination therapy is helpful because it increases the maximal bladder capacity and decreases the NPI. Furthermore, combination therapy increased the persistence of desmopressin in patients with NPU. PMID:25558419

  9. The Parallel Axiom

    ERIC Educational Resources Information Center

    Rogers, Pat

    1972-01-01

    Criteria for a reasonable axiomatic system are discussed. A discussion of the historical attempts to prove the independence of Euclids parallel postulate introduces non-Euclidean geometries. Poincare's model for a non-Euclidean geometry is defined and analyzed. (LS)

  10. Parallels with nature

    NASA Astrophysics Data System (ADS)

    2014-10-01

    Adam Nelson and Stuart Warriner, from the University of Leeds, talk with Nature Chemistry about their work to develop viable synthetic strategies for preparing new chemical structures in parallel with the identification of desirable biological activity.

  11. The Parallel Axiom

    ERIC Educational Resources Information Center

    Rogers, Pat

    1972-01-01

    Criteria for a reasonable axiomatic system are discussed. A discussion of the historical attempts to prove the independence of Euclids parallel postulate introduces non-Euclidean geometries. Poincare's model for a non-Euclidean geometry is defined and analyzed. (LS)

  12. Scalable parallel communications

    NASA Technical Reports Server (NTRS)

    Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.

    1992-01-01

    Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth

  13. Gang scheduling a parallel machine. Revision 1

    SciTech Connect

    Gorda, B.C.; Brooks, E.D. III

    1991-12-01

    Program development on parallel machines can be a nightmare of scheduling headaches. We have developed a portable time sharing mechanism to handle the problem of scheduling gangs of processes. User programs and their gangs of processes are put to sleep and awakened by the gang scheduler to provide a time sharing environment. Time quantum are adjusted according to priority queues and a system of fair share accounting. The initial platform for this software is the 128 processor BBN TC2000 in use in the Massively Parallel Computing Initiative at the Lawrence Livermore National Laboratory.

  14. Object-oriented parallel polygon rendering

    SciTech Connect

    Heiland, R.W.

    1994-09-01

    Since many scientific datasets can be visualized using some polygonal representation, a polygon renderer has broad use for scientific visualization. With today`s high performance computing applications producing very large datasets, a parallel polygon renderer is a necessary tool for keeping the compute-visualize cycle at a minimum. This paper presents a DOIV on renderer that combines the shared-memory and message-passing models of parallel programming. It uses the Global Arrays library, a shared-memory programming toolkit for distributed memory machines. The experience of using an object oriented approach for software design and development is also discussed.

  15. Revisiting and parallelizing SHAKE

    NASA Astrophysics Data System (ADS)

    Weinbach, Yael; Elber, Ron

    2005-10-01

    An algorithm is presented for running SHAKE in parallel. SHAKE is a widely used approach to compute molecular dynamics trajectories with constraints. An essential step in SHAKE is the solution of a sparse linear problem of the type Ax = b, where x is a vector of unknowns. Conjugate gradient minimization (that can be done in parallel) replaces the widely used iteration process that is inherently serial. Numerical examples present good load balancing and are limited only by communication time.

  16. Parallel image compression

    NASA Technical Reports Server (NTRS)

    Reif, John H.

    1987-01-01

    A parallel compression algorithm for the 16,384 processor MPP machine was developed. The serial version of the algorithm can be viewed as a combination of on-line dynamic lossless test compression techniques (which employ simple learning strategies) and vector quantization. These concepts are described. How these concepts are combined to form a new strategy for performing dynamic on-line lossy compression is discussed. Finally, the implementation of this algorithm in a massively parallel fashion on the MPP is discussed.

  17. Parallel strategies for SAR processing

    NASA Astrophysics Data System (ADS)

    Segoviano, Jesus A.

    2004-12-01

    This article proposes a series of strategies for improving the computer process of the Synthetic Aperture Radar (SAR) signal treatment, following the three usual lines of action to speed up the execution of any computer program. On the one hand, it is studied the optimization of both, the data structures and the application architecture used on it. On the other hand it is considered a hardware improvement. For the former, they are studied both, the usually employed SAR process data structures, proposing the use of parallel ones and the way the parallelization of the algorithms employed on the process is implemented. Besides, the parallel application architecture classifies processes between fine/coarse grain. These are assigned to individual processors or separated in a division among processors, all of them in their corresponding architectures. For the latter, it is studied the hardware employed on the computer parallel process used in the SAR handling. The improvement here refers to several kinds of platforms in which the SAR process is implemented, shared memory multicomputers, and distributed memory multiprocessors. A comparison between them gives us some guidelines to follow in order to get a maximum throughput with a minimum latency and a maximum effectiveness with a minimum cost, all together with a limited complexness. It is concluded and described, that the approach consisting of the processing of the algorithms in a GNU/Linux environment, together with a Beowulf cluster platform offers, under certain conditions, the best compromise between performance and cost, and promises the major development in the future for the Synthetic Aperture Radar computer power thirsty applications in the next years.

  18. A parallel PCG solver for MODFLOW.

    PubMed

    Dong, Yanhui; Li, Guomin

    2009-01-01

    In order to simulate large-scale ground water flow problems more efficiently with MODFLOW, the OpenMP programming paradigm was used to parallelize the preconditioned conjugate-gradient (PCG) solver with in this study. Incremental parallelization, the significant advantage supported by OpenMP on a shared-memory computer, made the solver transit to a parallel program smoothly one block of code at a time. The parallel PCG solver, suitable for both MODFLOW-2000 and MODFLOW-2005, is verified using an 8-processor computer. Both the impact of compilers and different model domain sizes were considered in the numerical experiments. Based on the timing results, execution times using the parallel PCG solver are typically about 1.40 to 5.31 times faster than those using the serial one. In addition, the simulation results are the exact same as the original PCG solver, because the majority of serial codes were not changed. It is worth noting that this parallelizing approach reduces cost in terms of software maintenance because only a single source PCG solver code needs to be maintained in the MODFLOW source tree.

  19. Parallel processing of a rotating shaft simulation

    NASA Technical Reports Server (NTRS)

    Arpasi, Dale J.

    1989-01-01

    A FORTRAN program describing the vibration modes of a rotor-bearing system is analyzed for parellelism in this simulation using a Pascal-like structured language. Potential vector operations are also identified. A critical path through the simulation is identified and used in conjunction with somewhat fictitious processor characteristics to determine the time to calculate the problem on a parallel processing system having those characteristics. A parallel processing overhead time is included as a parameter for proper evaluation of the gain over serial calculation. The serial calculation time is determined for the same fictitious system. An improvement of up to 640 percent is possible depending on the value of the overhead time. Based on the analysis, certain conclusions are drawn pertaining to the development needs of parallel processing technology, and to the specification of parallel processing systems to meet computational needs.

  20. Visualization and Tracking of Parallel CFD Simulations

    NASA Technical Reports Server (NTRS)

    Vaziri, Arsi; Kremenetsky, Mark

    1995-01-01

    We describe a system for interactive visualization and tracking of a 3-D unsteady computational fluid dynamics (CFD) simulation on a parallel computer. CM/AVS, a distributed, parallel implementation of a visualization environment (AVS) runs on the CM-5 parallel supercomputer. A CFD solver is run as a CM/AVS module on the CM-5. Data communication between the solver, other parallel visualization modules, and a graphics workstation, which is running AVS, are handled by CM/AVS. Partitioning of the visualization task, between CM-5 and the workstation, can be done interactively in the visual programming environment provided by AVS. Flow solver parameters can also be altered by programmable interactive widgets. This system partially removes the requirement of storing large solution files at frequent time steps, a characteristic of the traditional 'simulate (yields) store (yields) visualize' post-processing approach.

  1. Efficient solid state NMR powder simulations using SMP and MPP parallel computation

    NASA Astrophysics Data System (ADS)

    Kristensen, Jørgen Holm; Farnan, Ian

    2003-04-01

    Methods for parallel simulation of solid state NMR powder spectra are presented for both shared and distributed memory parallel supercomputers. For shared memory architectures the performance of simulation programs implementing the OpenMP application programming interface is evaluated. It is demonstrated that the design of correct and efficient shared memory parallel programs is difficult as the performance depends on data locality and cache memory effects. The distributed memory parallel programming model is examined for simulation programs using the MPI message passing interface. The results reveal that both shared and distributed memory parallel computation are very efficient with an almost perfect application speedup and may be applied to the most advanced powder simulations.

  2. Hypergraph partitioning implementation for parallelizing matrix-vector multiplication using CUDA GPU-based parallel computing

    NASA Astrophysics Data System (ADS)

    Murni, Bustamam, A.; Ernastuti, Handhika, T.; Kerami, D.

    2017-07-01

    Calculation of the matrix-vector multiplication in the real-world problems often involves large matrix with arbitrary size. Therefore, parallelization is needed to speed up the calculation process that usually takes a long time. Graph partitioning techniques that have been discussed in the previous studies cannot be used to complete the parallelized calculation of matrix-vector multiplication with arbitrary size. This is due to the assumption of graph partitioning techniques that can only solve the square and symmetric matrix. Hypergraph partitioning techniques will overcome the shortcomings of the graph partitioning technique. This paper addresses the efficient parallelization of matrix-vector multiplication through hypergraph partitioning techniques using CUDA GPU-based parallel computing. CUDA (compute unified device architecture) is a parallel computing platform and programming model that was created by NVIDIA and implemented by the GPU (graphics processing unit).

  3. Sublattice parallel replica dynamics

    NASA Astrophysics Data System (ADS)

    Martínez, Enrique; Uberuaga, Blas P.; Voter, Arthur F.

    2014-06-01

    Exascale computing presents a challenge for the scientific community as new algorithms must be developed to take full advantage of the new computing paradigm. Atomistic simulation methods that offer full fidelity to the underlying potential, i.e., molecular dynamics (MD) and parallel replica dynamics, fail to use the whole machine speedup, leaving a region in time and sample size space that is unattainable with current algorithms. In this paper, we present an extension of the parallel replica dynamics algorithm [A. F. Voter, Phys. Rev. B 57, R13985 (1998), 10.1103/PhysRevB.57.R13985] by combining it with the synchronous sublattice approach of Shim and Amar [Y. Shim and J. G. Amar, Phys. Rev. B 71, 125432 (2005), 10.1103/PhysRevB.71.125432], thereby exploiting event locality to improve the algorithm scalability. This algorithm is based on a domain decomposition in which events happen independently in different regions in the sample. We develop an analytical expression for the speedup given by this sublattice parallel replica dynamics algorithm and compare it with parallel MD and traditional parallel replica dynamics. We demonstrate how this algorithm, which introduces a slight additional approximation of event locality, enables the study of physical systems unreachable with traditional methodologies and promises to better utilize the resources of current high performance and future exascale computers.

  4. Parallel time integration software

    SciTech Connect

    2014-07-01

    This package implements an optimal-scaling multigrid solver for the (non) linear systems that arise from the discretization of problems with evolutionary behavior. Typically, solution algorithms for evolution equations are based on a time-marching approach, solving sequentially for one time step after the other. Parallelism in these traditional time-integrarion techniques is limited to spatial parallelism. However, current trends in computer architectures are leading twards system with more, but not faster. processors. Therefore, faster compute speeds must come from greater parallelism. One approach to achieve parallelism in time is with multigrid, but extending classical multigrid methods for elliptic poerators to this setting is a significant achievement. In this software, we implement a non-intrusive, optimal-scaling time-parallel method based on multigrid reduction techniques. The examples in the package demonstrate optimality of our multigrid-reduction-in-time algorithm (MGRIT) for solving a variety of parabolic equations in two and three sparial dimensions. These examples can also be used to show that MGRIT can achieve significant speedup in comparison to sequential time marching on modern architectures.

  5. Parallel architectures for vision

    SciTech Connect

    Maresca, M. ); Lavin, M.A. ); Li, H. )

    1988-08-01

    Vision computing involves the execution of a large number of operations on large sets of structured data. Sequential computers cannot achieve the speed required by most of the current applications and therefore parallel architectural solutions have to be explored. In this paper the authors examine the options that drive the design of a vision oriented computer, starting with the analysis of the basic vision computation and communication requirements. They briefly review the classical taxonomy for parallel computers, based on the multiplicity of the instruction and data stream, and apply a recently proposed criterion, the degree of autonomy of each processor, to further classify fine-grain SIMD massively parallel computers. They identify three types of processor autonomy, namely operation autonomy, addressing autonomy, and connection autonomy. For each type they give the basic definitions and show some examples. They focus on the concept of connection autonomy, which they believe is a key point in the development of massively parallel architectures for vision. They show two examples of parallel computers featuring different types of connection autonomy - the Connection Machine and the Polymorphic-Torus - and compare their cost and benefit.

  6. Performance Evaluation in Network-Based Parallel Computing

    NASA Technical Reports Server (NTRS)

    Dezhgosha, Kamyar

    1996-01-01

    Network-based parallel computing is emerging as a cost-effective alternative for solving many problems which require use of supercomputers or massively parallel computers. The primary objective of this project has been to conduct experimental research on performance evaluation for clustered parallel computing. First, a testbed was established by augmenting our existing SUNSPARCs' network with PVM (Parallel Virtual Machine) which is a software system for linking clusters of machines. Second, a set of three basic applications were selected. The applications consist of a parallel search, a parallel sort, a parallel matrix multiplication. These application programs were implemented in C programming language under PVM. Third, we conducted performance evaluation under various configurations and problem sizes. Alternative parallel computing models and workload allocations for application programs were explored. The performance metric was limited to elapsed time or response time which in the context of parallel computing can be expressed in terms of speedup. The results reveal that the overhead of communication latency between processes in many cases is the restricting factor to performance. That is, coarse-grain parallelism which requires less frequent communication between processes will result in higher performance in network-based computing. Finally, we are in the final stages of installing an Asynchronous Transfer Mode (ATM) switch and four ATM interfaces (each 155 Mbps) which will allow us to extend our study to newer applications, performance metrics, and configurations.

  7. Performance Evaluation in Network-Based Parallel Computing

    NASA Technical Reports Server (NTRS)

    Dezhgosha, Kamyar

    1996-01-01

    Network-based parallel computing is emerging as a cost-effective alternative for solving many problems which require use of supercomputers or massively parallel computers. The primary objective of this project has been to conduct experimental research on performance evaluation for clustered parallel computing. First, a testbed was established by augmenting our existing SUNSPARCs' network with PVM (Parallel Virtual Machine) which is a software system for linking clusters of machines. Second, a set of three basic applications were selected. The applications consist of a parallel search, a parallel sort, a parallel matrix multiplication. These application programs were implemented in C programming language under PVM. Third, we conducted performance evaluation under various configurations and problem sizes. Alternative parallel computing models and workload allocations for application programs were explored. The performance metric was limited to elapsed time or response time which in the context of parallel computing can be expressed in terms of speedup. The results reveal that the overhead of communication latency between processes in many cases is the restricting factor to performance. That is, coarse-grain parallelism which requires less frequent communication between processes will result in higher performance in network-based computing. Finally, we are in the final stages of installing an Asynchronous Transfer Mode (ATM) switch and four ATM interfaces (each 155 Mbps) which will allow us to extend our study to newer applications, performance metrics, and configurations.

  8. Parallel optical sampler

    DOEpatents

    Tauke-Pedretti, Anna; Skogen, Erik J; Vawter, Gregory A

    2014-05-20

    An optical sampler includes a first and second 1.times.n optical beam splitters splitting an input optical sampling signal and an optical analog input signal into n parallel channels, respectively, a plurality of optical delay elements providing n parallel delayed input optical sampling signals, n photodiodes converting the n parallel optical analog input signals into n respective electrical output signals, and n optical modulators modulating the input optical sampling signal or the optical analog input signal by the respective electrical output signals, and providing n successive optical samples of the optical analog input signal. A plurality of output photodiodes and eADCs convert the n successive optical samples to n successive digital samples. The optical modulator may be a photodiode interconnected Mach-Zehnder Modulator. A method of sampling the optical analog input signal is disclosed.

  9. Collisionless parallel shocks

    SciTech Connect

    Khabibrakhmanov, I.K. ); Galeev, A.A.; Galinsky, V.L. )

    1993-02-01

    A collisionless parallel shock model is presented which is based on solitary-type solutions of the modified derivative nonlinear Schrodinger equation (MDNLS) for parallel Alfven waves. We generalize the standard derivative nonlinear Schrodinger equation in order to include the possible anisotropy of the plasma distribution function and higher-order Korteweg-de Vies type dispersion. Stationary solutions of MDNLS are discussed. The new mechanism, which can be called [open quote]adiabatic[close quote] of ion reflection from the magnetic mirror of the parallel shock structure is the natural and essential feature of the parallel shock that introduces the irreversible properties into the nonlinear wave structure and may significantly contribute to the plasma heating upstream as well as downstream of the shock. The anisotropic nature of [open quotes]adiabatic[close quotes] reflections leads to the asymmetric particle distribution in the upstream as well in the downstream regions of the shock. As a result, nonzero heat flux appears near the front of the shock. It is shown that this causes the stochastic behavior of the nonlinear waves which can significantly contribute to the shock thermalization. The number of adiabaticaly reflected ions define the threshold conditions of the fire-hose and mirror type instabilities in the downstream and upstream regions and thus determine a parameter region in which the described laminar parallel shock structure can exist. The threshold conditions for the fire hose and mirror-type instabilities in the downstream and upstream regions of the shock are defined by the number of reflected particles and thus determine a parameter region in which the described laminar parallel shock structure can exist. 29 refs., 4 figs.

  10. Highly parallel computation

    NASA Technical Reports Server (NTRS)

    Denning, Peter J.; Tichy, Walter F.

    1990-01-01

    Among the highly parallel computing architectures required for advanced scientific computation, those designated 'MIMD' and 'SIMD' have yielded the best results to date. The present development status evaluation of such architectures shown neither to have attained a decisive advantage in most near-homogeneous problems' treatment; in the cases of problems involving numerous dissimilar parts, however, such currently speculative architectures as 'neural networks' or 'data flow' machines may be entailed. Data flow computers are the most practical form of MIMD fine-grained parallel computers yet conceived; they automatically solve the problem of assigning virtual processors to the real processors in the machine.

  11. Ion parallel closures

    NASA Astrophysics Data System (ADS)

    Ji, Jeong-Young; Lee, Hankyu Q.; Held, Eric D.

    2017-02-01

    Ion parallel closures are obtained for arbitrary atomic weights and charge numbers. For arbitrary collisionality, the heat flow and viscosity are expressed as kernel-weighted integrals of the temperature and flow-velocity gradients. Simple, fitted kernel functions are obtained from the 1600 parallel moment solution and the asymptotic behavior in the collisionless limit. The fitted kernel parameters are tabulated for various temperature ratios of ions to electrons. The closures can be used conveniently without solving the kinetic equation or higher order moment equations in closing ion fluid equations.

  12. CRUNCH_PARALLEL

    SciTech Connect

    Shumaker, Dana E.; Steefel, Carl I.

    2016-06-21

    The code CRUNCH_PARALLEL is a parallel version of the CRUNCH code. CRUNCH code version 2.0 was previously released by LLNL, (UCRL-CODE-200063). Crunch is a general purpose reactive transport code developed by Carl Steefel and Yabusake (Steefel Yabsaki 1996). The code handles non-isothermal transport and reaction in one, two, and three dimensions. The reaction algorithm is generic in form, handling an arbitrary number of aqueous and surface complexation as well as mineral dissolution/precipitation. A standardized database is used containing thermodynamic and kinetic data. The code includes advective, dispersive, and diffusive transport.

  13. Adaptive parallel logic networks

    NASA Technical Reports Server (NTRS)

    Martinez, Tony R.; Vidal, Jacques J.

    1988-01-01

    Adaptive, self-organizing concurrent systems (ASOCS) that combine self-organization with massive parallelism for such applications as adaptive logic devices, robotics, process control, and system malfunction management, are presently discussed. In ASOCS, an adaptive network composed of many simple computing elements operating in combinational and asynchronous fashion is used and problems are specified by presenting if-then rules to the system in the form of Boolean conjunctions. During data processing, which is a different operational phase from adaptation, the network acts as a parallel hardware circuit.

  14. Design strategies for irregularly adapting parallel applications

    SciTech Connect

    Oliker, Leonid; Biswas, Rupak; Shan, Hongzhang; Sing, Jaswinder Pal

    2000-11-01

    Achieving scalable performance for dynamic irregular applications is eminently challenging. Traditional message-passing approaches have been making steady progress towards this goal; however, they suffer from complex implementation requirements. The use of a global address space greatly simplifies the programming task, but can degrade the performance of dynamically adapting computations. In this work, we examine two major classes of adaptive applications, under five competing programming methodologies and four leading parallel architectures. Results indicate that it is possible to achieve message-passing performance using shared-memory programming techniques by carefully following the same high level strategies. Adaptive applications have computational work loads and communication patterns which change unpredictably at runtime, requiring dynamic load balancing to achieve scalable performance on parallel machines. Efficient parallel implementations of such adaptive applications are therefore a challenging task. This work examines the implementation of two typical adaptive applications, Dynamic Remeshing and N-Body, across various programming paradigms and architectural platforms. We compare several critical factors of the parallel code development, including performance, programmability, scalability, algorithmic development, and portability.

  15. Parallel Education and Defining the Fourth Sector.

    ERIC Educational Resources Information Center

    Chessell, Diana

    1996-01-01

    Parallel to the primary, secondary, postsecondary, and adult/community education sectors is education not associated with formal programs--learning in arts and cultural sites. The emergence of cultural and educational tourism is an opportunity for adult/community education to define itself by extending lifelong learning opportunities into parallel…

  16. Parallel versions of FORM and more

    NASA Astrophysics Data System (ADS)

    Steinhauser, Matthias; Ueda, Takahiro; Vermaseren, Jos A. M.

    2015-04-01

    We review the status of the parallel versions of the computer algebra system FORM. In particular, we provide a brief overview about the historical developments, discuss the strengths of ParFORM and TFORM, and mention typical applications. Furthermore, we briefly discuss the programs FIRE and FIESTA, which have also been developed with the Collaborative Research Center/TR 9 (CRC/TR 9).

  17. A parallel, portable and versatile treecode

    SciTech Connect

    Warren, M.S.; Salmon, J.K. |

    1994-10-01

    Portability and versatility are important characteristics of a computer program which is meant to be generally useful. We describe how we have developed a parallel N-body treecode to meet these goals. A variety of applications to which the code can be applied are mentioned. Performance of the program is also measured on several machines. A 512 processor Intel Paragon can solve for the forces on 10 million gravitationally interacting particles to 0.5% rms accuracy in 28.6 seconds.

  18. The Parallel of Decomposition of Linear Programs

    DTIC Science & Technology

    1989-11-01

    0.1412250000000D+04 1.79 90% 1.3 1391 SCTAPI 10 3 640 38 212 2839 2291 1887 1086 0.1412250DODO D+04 2.11 70% 1.5 18% SCrAPI 10 3 645 33 231 2969 2417...2965 2362 1915 1029 0.14122500OOOD+04 2.30 57% 1.5 191 SCTAPI 10 4 719 34 331 3806 3207 2616 143? 0.1412250000000D+04 2.24 56% 1.1 18A SCrAPI 10 4

  19. Debugging Parallel Programs with Instant Replay.

    DTIC Science & Technology

    1986-09-01

    collected during a single execution. Behavioral Abstraction (BA) is typical of this approach 121. BA provides a mechanism for the hierarhical definition ...exclusive access to the object. This description of a CREW access protocol is intended to be illustrative, not definitive . Instant Replay requires...slightly beyond 32 processors because the additional communication involved is not justified by the pin in paralelism 1141. A naive placement of

  20. Semantic Language Extensions for Implicit Parallel Programming

    DTIC Science & Technology

    2013-09-01

    hugely popular amongst astrophysi- cists. BUGS [164], a DSL for describing graphical models, is almost the de facto standard amongst researchers...anisms to completely share every clause learnt amongst the workers can lead to increased overheads while having private copies with no sharing can...insertion of the same element. 2. Complete Sharing [40, 128, 179]: A weakly consistent set is realized by a sin- gle data structure that is shared amongst

  1. A Programming Environment for Parallel Vision Algorithms.

    DTIC Science & Technology

    1987-02-01

    applications, a combination of SMP and Modula-2 could replace the use of C and Chr salis system calls on the Butterfl . Our SNIP and Modula-2 implementations...ha~e been in use on the Butterfly for 6 months. SMP is now being used within BB\\ and has been requested by several other Butterfl \\ sites. Both...the daemon to create a nek shell (on the Btttcrfl) khich in turn allocates an appropriate number of Butterfl nodes (see the section on Softkare

  2. Parallel molecular dynamics: Communication requirements for massively parallel machines

    NASA Astrophysics Data System (ADS)

    Taylor, Valerie E.; Stevens, Rick L.; Arnold, Kathryn E.

    1995-05-01

    Molecular mechanics and dynamics are becoming widely used to perform simulations of molecular systems from large-scale computations of materials to the design and modeling of drug compounds. In this paper we address two major issues: a good decomposition method that can take advantage of future massively parallel processing systems for modest-sized problems in the range of 50,000 atoms and the communication requirements needed to achieve 30 to 40% efficiency on MPPs. We analyzed a scalable benchmark molecular dynamics program executing on the Intel Touchstone Deleta parallelized with an interaction decomposition method. Using a validated analytical performance model of the code, we determined that for an MPP with a four-dimensional mesh topology and 400 MHz processors the communication startup time must be at most 30 clock cycles and the network bandwidth must be at least 2.3 GB/s. This configuration results in 30 to 40% efficiency of the MPP for a problem with 50,000 atoms executing on 50,000 processors.

  3. Massively parallel processor computer

    NASA Technical Reports Server (NTRS)

    Fung, L. W. (Inventor)

    1983-01-01

    An apparatus for processing multidimensional data with strong spatial characteristics, such as raw image data, characterized by a large number of parallel data streams in an ordered array is described. It comprises a large number (e.g., 16,384 in a 128 x 128 array) of parallel processing elements operating simultaneously and independently on single bit slices of a corresponding array of incoming data streams under control of a single set of instructions. Each of the processing elements comprises a bidirectional data bus in communication with a register for storing single bit slices together with a random access memory unit and associated circuitry, including a binary counter/shift register device, for performing logical and arithmetical computations on the bit slices, and an I/O unit for interfacing the bidirectional data bus with the data stream source. The massively parallel processor architecture enables very high speed processing of large amounts of ordered parallel data, including spatial translation by shifting or sliding of bits vertically or horizontally to neighboring processing elements.

  4. Parallel Total Energy

    SciTech Connect

    Wang, Lin-Wang

    2004-10-21

    This is a total energy electronic structure code using Local Density Approximation (LDA) of the density funtional theory. It uses the plane wave as the wave function basis set. It can sue both the norm conserving pseudopotentials and the ultra soft pseudopotentials. It can relax the atomic positions according to the total energy. It is a parallel code using MP1.

  5. Parallel hierarchical radiosity rendering

    SciTech Connect

    Carter, Michael

    1993-07-01

    In this dissertation, the step-by-step development of a scalable parallel hierarchical radiosity renderer is documented. First, a new look is taken at the traditional radiosity equation, and a new form is presented in which the matrix of linear system coefficients is transformed into a symmetric matrix, thereby simplifying the problem and enabling a new solution technique to be applied. Next, the state-of-the-art hierarchical radiosity methods are examined for their suitability to parallel implementation, and scalability. Significant enhancements are also discovered which both improve their theoretical foundations and improve the images they generate. The resultant hierarchical radiosity algorithm is then examined for sources of parallelism, and for an architectural mapping. Several architectural mappings are discussed. A few key algorithmic changes are suggested during the process of making the algorithm parallel. Next, the performance, efficiency, and scalability of the algorithm are analyzed. The dissertation closes with a discussion of several ideas which have the potential to further enhance the hierarchical radiosity method, or provide an entirely new forum for the application of hierarchical methods.

  6. [The parallel saw blade].

    PubMed

    Mühldorfer-Fodor, M; Hohendorff, B; Prommersberger, K-J; van Schoonhoven, J

    2011-04-01

    For shortening osteotomy, two exactly parallel osteotomies are needed to assure a congruent adaption of the shortened bone after segment resection. This is required for regular bone healing. In addition, it is difficult to shorten a bone to a precise distance using an oblique segment resection. A mobile spacer between two saw blades keeps the distance of the blades exactly parallel during an osteotomy cut. The parallel saw blades from Synthes® are designed for 2, 2.5, 3, 4, and 5 mm shortening distances. Two types of blades are available (e.g., for transverse or oblique osteotomies) to assure precise shortening. Preoperatively, the desired type of osteotomy (transverse or oblique) and the shortening distance has to be determined. Then, the appropriate parallel saw blade is chosen, which is compatible to Synthes® Colibri with an oscillating saw attachment. During the osteotomy cut, the spacer should be kept as close to the bone as possible. Excessive force that may deform the blades should be avoided. Before manipulating the bone ends, it is important to determine that the bone is completely dissected by both saw blades to prevent fracturing of the corticalis with bony spurs. The shortening osteotomy is mainly fixated by plate osteosynthesis. For compression of the bone ends, the screws should be placed eccentrically in the plate holes. For an oblique osteotomy, an additional lag screw should be used.

  7. Parallel Coordinate Axes.

    ERIC Educational Resources Information Center

    Friedlander, Alex; And Others

    1982-01-01

    Several methods of numerical mappings other than the usual cartesian coordinate system are considered. Some examples using parallel axes representation, which are seen to lead to aesthetically pleasing or interesting configurations, are presented. Exercises with alternative representations can stimulate pupil imagination and exploration in…

  8. Parallel Dislocation Simulator

    SciTech Connect

    2006-10-30

    ParaDiS is software capable of simulating the motion, evolution, and interaction of dislocation networks in single crystals using massively parallel computer architectures. The software is capable of outputting the stress-strain response of a single crystal whose plastic deformation is controlled by the dislocation processes.

  9. Parallel fast gauss transform

    SciTech Connect

    Sampath, Rahul S; Sundar, Hari; Veerapaneni, Shravan

    2010-01-01

    We present fast adaptive parallel algorithms to compute the sum of N Gaussians at N points. Direct sequential computation of this sum would take O(N{sup 2}) time. The parallel time complexity estimates for our algorithms are O(N/n{sub p}) for uniform point distributions and O( (N/n{sub p}) log (N/n{sub p}) + n{sub p}log n{sub p}) for non-uniform distributions using n{sub p} CPUs. We incorporate a plane-wave representation of the Gaussian kernel which permits 'diagonal translation'. We use parallel octrees and a new scheme for translating the plane-waves to efficiently handle non-uniform distributions. Computing the transform to six-digit accuracy at 120 billion points took approximately 140 seconds using 4096 cores on the Jaguar supercomputer. Our implementation is 'kernel-independent' and can handle other 'Gaussian-type' kernels even when explicit analytic expression for the kernel is not known. These algorithms form a new class of core computational machinery for solving parabolic PDEs on massively parallel architectures.

  10. Parallel hierarchical global illumination

    SciTech Connect

    Snell, Quinn O.

    1997-10-08

    Solving the global illumination problem is equivalent to determining the intensity of every wavelength of light in all directions at every point in a given scene. The complexity of the problem has led researchers to use approximation methods for solving the problem on serial computers. Rather than using an approximation method, such as backward ray tracing or radiosity, the authors have chosen to solve the Rendering Equation by direct simulation of light transport from the light sources. This paper presents an algorithm that solves the Rendering Equation to any desired accuracy, and can be run in parallel on distributed memory or shared memory computer systems with excellent scaling properties. It appears superior in both speed and physical correctness to recent published methods involving bidirectional ray tracing or hybrid treatments of diffuse and specular surfaces. Like progressive radiosity methods, it dynamically refines the geometry decomposition where required, but does so without the excessive storage requirements for ray histories. The algorithm, called Photon, produces a scene which converges to the global illumination solution. This amounts to a huge task for a 1997-vintage serial computer, but using the power of a parallel supercomputer significantly reduces the time required to generate a solution. Currently, Photon can be run on most parallel environments from a shared memory multiprocessor to a parallel supercomputer, as well as on clusters of heterogeneous workstations.

  11. Parallel Multigrid Equation Solver

    SciTech Connect

    Adams, Mark

    2001-09-07

    Prometheus is a fully parallel multigrid equation solver for matrices that arise in unstructured grid finite element applications. It includes a geometric and an algebraic multigrid method and has solved problems of up to 76 mullion degrees of feedom, problems in linear elasticity on the ASCI blue pacific and ASCI red machines.

  12. High Performance Parallel Computational Nanotechnology

    NASA Technical Reports Server (NTRS)

    Saini, Subhash; Craw, James M. (Technical Monitor)

    1995-01-01

    At a recent press conference, NASA Administrator Dan Goldin encouraged NASA Ames Research Center to take a lead role in promoting research and development of advanced, high-performance computer technology, including nanotechnology. Manufacturers of leading-edge microprocessors currently perform large-scale simulations in the design and verification of semiconductor devices and microprocessors. Recently, the need for this intensive simulation and modeling analysis has greatly increased, due in part to the ever-increasing complexity of these devices, as well as the lessons of experiences such as the Pentium fiasco. Simulation, modeling, testing, and validation will be even more important for designing molecular computers because of the complex specification of millions of atoms, thousands of assembly steps, as well as the simulation and modeling needed to ensure reliable, robust and efficient fabrication of the molecular devices. The software for this capacity does not exist today, but it can be extrapolated from the software currently used in molecular modeling for other applications: semi-empirical methods, ab initio methods, self-consistent field methods, Hartree-Fock methods, molecular mechanics; and simulation methods for diamondoid structures. In as much as it seems clear that the application of such methods in nanotechnology will require powerful, highly powerful systems, this talk will discuss techniques and issues for performing these types of computations on parallel systems. We will describe system design issues (memory, I/O, mass storage, operating system requirements, special user interface issues, interconnects, bandwidths, and programming languages) involved in parallel methods for scalable classical, semiclassical, quantum, molecular mechanics, and continuum models; molecular nanotechnology computer-aided designs (NanoCAD) techniques; visualization using virtual reality techniques of structural models and assembly sequences; software required to

  13. High Performance Parallel Computational Nanotechnology

    NASA Technical Reports Server (NTRS)

    Saini, Subhash; Craw, James M. (Technical Monitor)

    1995-01-01

    At a recent press conference, NASA Administrator Dan Goldin encouraged NASA Ames Research Center to take a lead role in promoting research and development of advanced, high-performance computer technology, including nanotechnology. Manufacturers of leading-edge microprocessors currently perform large-scale simulations in the design and verification of semiconductor devices and microprocessors. Recently, the need for this intensive simulation and modeling analysis has greatly increased, due in part to the ever-increasing complexity of these devices, as well as the lessons of experiences such as the Pentium fiasco. Simulation, modeling, testing, and validation will be even more important for designing molecular computers because of the complex specification of millions of atoms, thousands of assembly steps, as well as the simulation and modeling needed to ensure reliable, robust and efficient fabrication of the molecular devices. The software for this capacity does not exist today, but it can be extrapolated from the software currently used in molecular modeling for other applications: semi-empirical methods, ab initio methods, self-consistent field methods, Hartree-Fock methods, molecular mechanics; and simulation methods for diamondoid structures. In as much as it seems clear that the application of such methods in nanotechnology will require powerful, highly powerful systems, this talk will discuss techniques and issues for performing these types of computations on parallel systems. We will describe system design issues (memory, I/O, mass storage, operating system requirements, special user interface issues, interconnects, bandwidths, and programming languages) involved in parallel methods for scalable classical, semiclassical, quantum, molecular mechanics, and continuum models; molecular nanotechnology computer-aided designs (NanoCAD) techniques; visualization using virtual reality techniques of structural models and assembly sequences; software required to

  14. A note on parallel efficiency of fire simulation on cluster

    NASA Astrophysics Data System (ADS)

    Valasek, L.; Glasa, J.

    2016-08-01

    Current HPC clusters are capable to reduce execution time of parallelized tasks significantly. The paper discusses the use of two selected strategies of cluster computational resources allocation and their impact on parallel efficiency of fire simulation. Simulation of a simple corridor fire scenario by Fire Dynamics Simulator parallelized by the MPI programming model is tested on the HPC cluster at the Institute of Informatics of Slovak Academy of Sciences in Bratislava (Slovakia). The tests confirm that parallelization has a great potential to reduce execution times achieving promising values of parallel efficiency of the simulation, however, the results also show that the use of increasing numbers of computational meshes resulting in increasing numbers of used computational cores does not necessarily decrease the execution time nor the parallel efficiency of simulation. The results obtained indicate that the simulation achieves different values of the execution time and the parallel efficiency in regard of the used strategy for cluster computational resources allocation.

  15. An Expert System for the Development of Efficient Parallel Code

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Chun, Robert; Jin, Hao-Qiang; Labarta, Jesus; Gimenez, Judit

    2004-01-01

    We have built the prototype of an expert system to assist the user in the development of efficient parallel code. The system was integrated into the parallel programming environment that is currently being developed at NASA Ames. The expert system interfaces to tools for automatic parallelization and performance analysis. It uses static program structure information and performance data in order to automatically determine causes of poor performance and to make suggestions for improvements. In this paper we give an overview of our programming environment, describe the prototype implementation of our expert system, and demonstrate its usefulness with several case studies.

  16. Parallel processors and nonlinear structural dynamics algorithms and software

    NASA Technical Reports Server (NTRS)

    Belytschko, T.

    1986-01-01

    A nonlinear structural dynamics program with an element library that exploits parallel processing is under development. The aim is to exploit scheduling-allocation so that parallel processing and vectorization can effectively be treated in a general purpose program. As a byproduct an automatic scheme for assigning time steps was devised. A rudimentary form of the program is complete and has been tested; it shows substantial advantage can be taken of parallelism. In addition, a stability proof for the subcycling algorithm has been developed.

  17. Experiences Running a Parallel Answer Set Solver on Blue Gene

    NASA Astrophysics Data System (ADS)

    Schneidenbach, Lars; Schnor, Bettina; Gebser, Martin; Kaminski, Roland; Kaufmann, Benjamin; Schaub, Torsten

    This paper presents the concept of parallelisation of a solver for Answer Set Programming (ASP). While there already exist some approaches to parallel ASP solving, there was a lack of a parallel version of the powerful clasp solver. We implemented a parallel version of clasp based on message-passing. Experimental results on Blue Gene P/L indicate the potential of such an approach.

  18. A Parallel Algorithm for the Vehicle Routing Problem

    SciTech Connect

    Groer, Christopher S; Golden, Bruce; Edward, Wasil

    2011-01-01

    The vehicle routing problem (VRP) is a dicult and well-studied combinatorial optimization problem. We develop a parallel algorithm for the VRP that combines a heuristic local search improvement procedure with integer programming. We run our parallel algorithm with as many as 129 processors and are able to quickly nd high-quality solutions to standard benchmark problems. We assess the impact of parallelism by analyzing our procedure's performance under a number of dierent scenarios.

  19. Impact of a nurse-directed, coordinated school health program to enhance physical activity behaviors and reduce body mass index among minority children: a parallel-group, randomized control trial.

    PubMed

    Wright, Kynna; Giger, Joyce Newman; Norris, Keth; Suro, Zulma

    2013-06-01

    Underserved children, particularly girls and those in urban communities, do not meet the recommended physical activity guidelines (>60min of daily physical activity), and this behavior can lead to obesity. The school years are known to be a critical period in the life course for shaping attitudes and behaviors. Children look to schools for much of their access to physical activity. Thus, through the provision of appropriate physical activity programs, schools have the power to influence apt physical activity choices, especially for underserved children where disparities in obesity-related outcomes exist. To evaluate the impact of a nurse directed, coordinated, culturally sensitive, school-based, family-centered lifestyle program on activity behaviors and body mass index. This was a parallel group, randomized controlled trial utilizing a community-based participatory research approach, through a partnership with a University and 5 community schools. Participants included 251 children ages 8-12 from elementary schools in urban, low-income neighborhoods in Los Angeles, USA. The intervention included Kids N Fitness(©), a 6-week program which met weekly to provide 45min of structured physical activity and a 45min nutrition education class for parents and children. Intervention sites also participated in school-wide wellness activities, including health and counseling services, staff professional development in health promotion, parental education newsletters, and wellness policies for the provision of healthy foods at the school. The Child and Adolescent Trial for Cardiovascular Health School Physical Activity and Nutrition Student Questionnaire measured physical activity behavior, including: daily physical activity, participation in team sports, attending physical education class, and TV viewing/computer game playing. Anthropometric measures included height, weight, body mass index, resting blood pressure, and waist circumference. Measures were collected at baseline

  20. Impact of a nurse-directed, coordinated school health program to enhance physical activity behaviors and reduce body mass index among minority children: A parallel-group, randomized control trial

    PubMed Central

    Wright, Kynna; Giger, Joyce Newman; Norris, Keth; Suro, Zulma

    2013-01-01

    Background Underserved children, particularly girls and those in urban communities, do not meet the recommended physical activity guidelines (>60 min of daily physical activity), and this behavior can lead to obesity. The school years are known to be a critical period in the life course for shaping attitudes and behaviors. Children look to schools for much of their access to physical activity. Thus, through the provision of appropriate physical activity programs, schools have the power to influence apt physical activity choices, especially for underserved children where disparities in obesity-related outcomes exist. Objectives To evaluate the impact of a nurse directed, coordinated, culturally sensitive, school-based, family-centered lifestyle program on activity behaviors and body mass index. Design, settings and participants: This was a parallel group, randomized controlled trial utilizing a community-based participatory research approach, through a partnership with a University and 5 community schools. Participants included 251 children ages 8–12 from elementary schools in urban, low-income neighborhoods in Los Angeles, USA. Methods The intervention included Kids N Fitness©, a 6-week program which met weekly to provide 45 min of structured physical activity and a 45 min nutrition education class for parents and children. Intervention sites also participated in school-wide wellness activities, including health and counseling services, staff professional development in health promotion, parental education newsletters, and wellness policies for the provision of healthy foods at the school. The Child and Adolescent Trial for Cardiovascular Health School Physical Activity and Nutrition Student Questionnaire measured physical activity behavior, including: daily physical activity, participation in team sports, attending physical education class, and TV viewing/computer game playing. Anthropometric measures included height, weight, body mass index, resting blood

  1. Distributed and parallel Ada and the Ada 9X recommendations

    NASA Technical Reports Server (NTRS)

    Volz, Richard A.; Goldsack, Stephen J.; Theriault, R.; Waldrop, Raymond S.; Holzbacher-Valero, A. A.

    1992-01-01

    Recently, the DoD has sponsored work towards a new version of Ada, intended to support the construction of distributed systems. The revised version, often called Ada 9X, will become the new standard sometimes in the 1990s. It is intended that Ada 9X should provide language features giving limited support for distributed system construction. The requirements for such features are given. Many of the most advanced computer applications involve embedded systems that are comprised of parallel processors or networks of distributed computers. If Ada is to become the widely adopted language envisioned by many, it is essential that suitable compilers and tools be available to facilitate the creation of distributed and parallel Ada programs for these applications. The major languages issues impacting distributed and parallel programming are reviewed, and some principles upon which distributed/parallel language systems should be built are suggested. Based upon these, alternative language concepts for distributed/parallel programming are analyzed.

  2. Distributed and parallel Ada and the Ada 9X recommendations

    SciTech Connect

    Volz, R.A.; Goldsack, S.J.; Theriault, R.; Waldrop, R.S.; Holzbacher-Valero, A.A.

    1992-04-01

    Recently, the DoD has sponsored work towards a new version of Ada, intended to support the construction of distributed systems. The revised version, often called Ada9x, will become the new standard sometimes in the 1990s. It is intended that Ada9x should provide language features giving limited support for distributed system construction. The requirements for such features are given. Many of the most advanced computer applications involve embedded systems that are comprised of parallel processors or networks of distributed computers. If Ada is to become the widely adopted language envisioned by many, it is essential that suitable compilers and tools be available to facilitate the creation of distributed and parallel Ada programs for these applications. The major languages issues impacting distributed and parallel programming are reviewed, and some principles upon which distributed/parallel language systems should be built are suggested. Based upon these, alternative language concepts for distributed/parallel programming are analyzed.

  3. Parallel computing using a Lagrangian formulation

    NASA Technical Reports Server (NTRS)

    Liou, May-Fun; Loh, Ching Yuen

    1991-01-01

    A new Lagrangian formulation of the Euler equation is adopted for the calculation of 2-D supersonic steady flow. The Lagrangian formulation represents the inherent parallelism of the flow field better than the common Eulerian formulation and offers a competitive alternative on parallel computers. The implementation of the Lagrangian formulation on the Thinking Machines Corporation CM-2 Computer is described. The program uses a finite volume, first-order Godunov scheme and exhibits high accuracy in dealing with multidimensional discontinuities (slip-line and shock). By using this formulation, a better than six times speed-up was achieved on a 8192-processor CM-2 over a single processor of a CRAY-2.

  4. Hypercluster parallel processing library user's manual

    NASA Technical Reports Server (NTRS)

    Quealy, Angela

    1990-01-01

    This User's Manual describes the Hypercluster Parallel Processing Library, composed of FORTRAN-callable subroutines which enable a FORTRAN programmer to manipulate and transfer information throughout the Hypercluster at NASA Lewis Research Center. Each subroutine and its parameters are described in detail. A simple heat flow application using Laplace's equation is included to demonstrate the use of some of the library's subroutines. The manual can be used initially as an introduction to the parallel features provided by the library. Thereafter it can be used as a reference when programming an application.

  5. Compositional reservoir simulation in parallel supercomputing environments

    SciTech Connect

    Briens, F.J.L. ); Wu, C.H. ); Gazdag, J.; Wang, H.H. )

    1991-09-01

    A large-scale compositional reservoir simulation ({gt}1,000 cells) is not often run on a conventional mainframe computer owing to excessive turnaround times. This paper presents programming and computational techniques that fully exploit the capabilities of parallel supercomputers for a large-scale compositional simulation. A novel algorithm called sequential staging of tasks (SST) that can take full advantage of parallel-vector processing to speed up the solution of a large linear system is introduced. The effectiveness of SST is illustrated with results from computer experiments conducted on an IBM 3090-600E.

  6. A Simple Physical Optics Algorithm Perfect for Parallel Computing Architecture

    NASA Technical Reports Server (NTRS)

    Imbriale, W. A.; Cwik, T.

    1994-01-01

    A reflector antenna computer program based upon a simple discreet approximation of the radiation integral has proven to be extremely easy to adapt to the parallel computing architecture of the modest number of large-gain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.

  7. A Simple Physical Optics Algorithm Perfect for Parallel Computing Architecture

    NASA Technical Reports Server (NTRS)

    Imbriale, W. A.; Cwik, T.

    1994-01-01

    A reflector antenna computer program based upon a simple discreet approximation of the radiation integral has proven to be extremely easy to adapt to the parallel computing architecture of the modest number of large-gain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.

  8. Parallel multilevel preconditioners

    SciTech Connect

    Bramble, J.H.; Pasciak, J.E.; Xu, Jinchao.

    1989-01-01

    In this paper, we shall report on some techniques for the development of preconditioners for the discrete systems which arise in the approximation of solutions to elliptic boundary value problems. Here we shall only state the resulting theorems. It has been demonstrated that preconditioned iteration techniques often lead to the most computationally effective algorithms for the solution of the large algebraic systems corresponding to boundary value problems in two and three dimensional Euclidean space. The use of preconditioned iteration will become even more important on computers with parallel architecture. This paper discusses an approach for developing completely parallel multilevel preconditioners. In order to illustrate the resulting algorithms, we shall describe the simplest application of the technique to a model elliptic problem.

  9. Homology, convergence and parallelism.

    PubMed

    Ghiselin, Michael T

    2016-01-05

    Homology is a relation of correspondence between parts of parts of larger wholes. It is used when tracking objects of interest through space and time and in the context of explanatory historical narratives. Homologues can be traced through a genealogical nexus back to a common ancestral precursor. Homology being a transitive relation, homologues remain homologous however much they may come to differ. Analogy is a relationship of correspondence between parts of members of classes having no relationship of common ancestry. Although homology is often treated as an alternative to convergence, the latter is not a kind of correspondence: rather, it is one of a class of processes that also includes divergence and parallelism. These often give rise to misleading appearances (homoplasies). Parallelism can be particularly hard to detect, especially when not accompanied by divergences in some parts of the body. © 2015 The Author(s).

  10. Ultrascalable petaflop parallel supercomputer

    DOEpatents

    Blumrich, Matthias A.; Chen, Dong; Chiu, George; Cipolla, Thomas M.; Coteus, Paul W.; Gara, Alan G.; Giampapa, Mark E.; Hall, Shawn; Haring, Rudolf A.; Heidelberger, Philip; Kopcsay, Gerard V.; Ohmacht, Martin; Salapura, Valentina; Sugavanam, Krishnan; Takken, Todd

    2010-07-20

    A massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements. The ASIC nodes are interconnected by multiple independent networks that optimally maximize the throughput of packet communications between nodes with minimal latency. The multiple networks may include three high-speed networks for parallel algorithm message passing including a Torus, collective network, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. The use of a DMA engine is provided to facilitate message passing among the nodes without the expenditure of processing resources at the node.

  11. Parallel Anisotropic Tetrahedral Adaptation

    NASA Technical Reports Server (NTRS)

    Park, Michael A.; Darmofal, David L.

    2008-01-01

    An adaptive method that robustly produces high aspect ratio tetrahedra to a general 3D metric specification without introducing hybrid semi-structured regions is presented. The elemental operators and higher-level logic is described with their respective domain-decomposed parallelizations. An anisotropic tetrahedral grid adaptation scheme is demonstrated for 1000-1 stretching for a simple cube geometry. This form of adaptation is applicable to more complex domain boundaries via a cut-cell approach as demonstrated by a parallel 3D supersonic simulation of a complex fighter aircraft. To avoid the assumptions and approximations required to form a metric to specify adaptation, an approach is introduced that directly evaluates interpolation error. The grid is adapted to reduce and equidistribute this interpolation error calculation without the use of an intervening anisotropic metric. Direct interpolation error adaptation is illustrated for 1D and 3D domains.

  12. Parallel grid population

    DOEpatents

    Wald, Ingo; Ize, Santiago

    2015-07-28

    Parallel population of a grid with a plurality of objects using a plurality of processors. One example embodiment is a method for parallel population of a grid with a plurality of objects using a plurality of processors. The method includes a first act of dividing a grid into n distinct grid portions, where n is the number of processors available for populating the grid. The method also includes acts of dividing a plurality of objects into n distinct sets of objects, assigning a distinct set of objects to each processor such that each processor determines by which distinct grid portion(s) each object in its distinct set of objects is at least partially bounded, and assigning a distinct grid portion to each processor such that each processor populates its distinct grid portion with any objects that were previously determined to be at least partially bounded by its distinct grid portion.

  13. Homology, convergence and parallelism

    PubMed Central

    Ghiselin, Michael T.

    2016-01-01

    Homology is a relation of correspondence between parts of parts of larger wholes. It is used when tracking objects of interest through space and time and in the context of explanatory historical narratives. Homologues can be traced through a genealogical nexus back to a common ancestral precursor. Homology being a transitive relation, homologues remain homologous however much they may come to differ. Analogy is a relationship of correspondence between parts of members of classes having no relationship of common ancestry. Although homology is often treated as an alternative to convergence, the latter is not a kind of correspondence: rather, it is one of a class of processes that also includes divergence and parallelism. These often give rise to misleading appearances (homoplasies). Parallelism can be particularly hard to detect, especially when not accompanied by divergences in some parts of the body. PMID:26598721

  14. Parallel Subconvolution Filtering Architectures

    NASA Technical Reports Server (NTRS)

    Gray, Andrew A.

    2003-01-01

    These architectures are based on methods of vector processing and the discrete-Fourier-transform/inverse-discrete- Fourier-transform (DFT-IDFT) overlap-and-save method, combined with time-block separation of digital filters into frequency-domain subfilters implemented by use of sub-convolutions. The parallel-processing method implemented in these architectures enables the use of relatively small DFT-IDFT pairs, while filter tap lengths are theoretically unlimited. The size of a DFT-IDFT pair is determined by the desired reduction in processing rate, rather than on the order of the filter that one seeks to implement. The emphasis in this report is on those aspects of the underlying theory and design rules that promote computational efficiency, parallel processing at reduced data rates, and simplification of the designs of very-large-scale integrated (VLSI) circuits needed to implement high-order filters and correlators.

  15. Xyce parallel electronic simulator.

    SciTech Connect

    Keiter, Eric R; Mei, Ting; Russo, Thomas V.; Rankin, Eric Lamont; Schiek, Richard Louis; Thornquist, Heidi K.; Fixel, Deborah A.; Coffey, Todd S; Pawlowski, Roger P; Santarelli, Keith R.

    2010-05-01

    This document is a reference guide to the Xyce Parallel Electronic Simulator, and is a companion document to the Xyce Users Guide. The focus of this document is (to the extent possible) exhaustively list device parameters, solver options, parser options, and other usage details of Xyce. This document is not intended to be a tutorial. Users who are new to circuit simulation are better served by the Xyce Users Guide.

  16. Parallel Computing in Optimization.

    DTIC Science & Technology

    1984-10-01

    include : Heller [1978] and Sameh [1977] (surveys of algorithms), Duff [1983], Fong and Jordan [1977]. Jordan [1979]. and Rodrigue [1982] (all mainly...constrained concave function by partition of feasible domain", Mathematics of Operations Research 8, pp. A. Sameh [1977, "Numerical parallel algorithms...a survey", in High Speed Computer and Algorithm Organization, D. Kuck, D. Lawrie, and A. Sameh , eds., Academic Press, pp. 207-228. 1,. J. Siegel

  17. Parallel unstructured grid generation

    NASA Technical Reports Server (NTRS)

    Loehner, Rainald; Camberos, Jose; Merriam, Marshal

    1991-01-01

    A parallel unstructured grid generation algorithm is presented and implemented on the Hypercube. Different processor hierarchies are discussed, and the appropraite hierarchies for mesh generation and mesh smoothing are selected. A domain-splitting algorithm for unstructured grids which tries to minimize the surface-to-volume ratio of each subdomain is described. This splitting algorithm is employed both for grid generation and grid smoothing. Results obtained on the Hypercube demonstrate the effectiveness of the algorithms developed.

  18. Implementation of Parallel Algorithms

    DTIC Science & Technology

    1993-06-30

    their socia ’ relations or to achieve some goals. For example, we define a pair-wise force law of i epulsion and attraction for a group of identical...quantization based compression schemes. Photo-refractive crystals, which provide high density recording in real time, are used as our holographic media . The...of Parallel Algorithms (J. Reif, ed.). Kluwer Academic Pu’ ishers, 1993. (4) "A Dynamic Separator Algorithm", D. Armon and J. Reif. To appear in

  19. Parallel sphere rendering

    SciTech Connect

    Krogh, M.; Painter, J.; Hansen, C.

    1996-10-01

    Sphere rendering is an important method for visualizing molecular dynamics data. This paper presents a parallel algorithm that is almost 90 times faster than current graphics workstations. To render extremely large data sets and large images, the algorithm uses the MIMD features of the supercomputers to divide up the data, render independent partial images, and then finally composite the multiple partial images using an optimal method. The algorithm and performance results are presented for the CM-5 and the M.

  20. The 2nd Symposium on the Frontiers of Massively Parallel Computations

    NASA Technical Reports Server (NTRS)

    Mills, Ronnie (Editor)

    1988-01-01

    Programming languages, computer graphics, neural networks, massively parallel computers, SIMD architecture, algorithms, digital terrain models, sort computation, simulation of charged particle transport on the massively parallel processor and image processing are among the topics discussed.

  1. Requirements for supercomputing in energy research: The transition to massively parallel computing

    SciTech Connect

    Not Available

    1993-02-01

    This report discusses: The emergence of a practical path to TeraFlop computing and beyond; requirements of energy research programs at DOE; implementation: supercomputer production computing environment on massively parallel computers; and implementation: user transition to massively parallel computing.

  2. Fast Inverse Distance Weighting-Based Spatiotemporal Interpolation: A Web-Based Application of Interpolating Daily Fine Particulate Matter PM2.5 in the Contiguous U.S. Using Parallel Programming and k-d Tree

    PubMed Central

    Li, Lixin; Losser, Travis; Yorke, Charles; Piltner, Reinhard

    2014-01-01

    Epidemiological studies have identified associations between mortality and changes in concentration of particulate matter. These studies have highlighted the public concerns about health effects of particulate air pollution. Modeling fine particulate matter PM2.5 exposure risk and monitoring day-to-day changes in PM2.5 concentration is a critical step for understanding the pollution problem and embarking on the necessary remedy. This research designs, implements and compares two inverse distance weighting (IDW)-based spatiotemporal interpolation methods, in order to assess the trend of daily PM2.5 concentration for the contiguous United States over the year of 2009, at both the census block group level and county level. Traditionally, when handling spatiotemporal interpolation, researchers tend to treat space and time separately and reduce the spatiotemporal interpolation problems to a sequence of snapshots of spatial interpolations. In this paper, PM2.5 data interpolation is conducted in the continuous space-time domain by integrating space and time simultaneously, using the so-called extension approach. Time values are calculated with the help of a factor under the assumption that spatial and temporal dimensions are equally important when interpolating a continuous changing phenomenon in the space-time domain. Various IDW-based spatiotemporal interpolation methods with different parameter configurations are evaluated by cross-validation. In addition, this study explores computational issues (computer processing speed) faced during implementation of spatiotemporal interpolation for huge data sets. Parallel programming techniques and an advanced data structure, named k-d tree, are adapted in this paper to address the computational challenges. Significant computational improvement has been achieved. Finally, a web-based spatiotemporal IDW-based interpolation application is designed and implemented where users can visualize and animate spatiotemporal interpolation

  3. Parallel technology for numerical modeling of fluid dynamics problems by high-accuracy algorithms

    NASA Astrophysics Data System (ADS)

    Gorobets, A. V.

    2015-04-01

    A parallel computation technology for modeling fluid dynamics problems by finite-volume and finite-difference methods of high accuracy is presented. The development of an algorithm, the design of a software implementation, and the creation of parallel programs for computations on large-scale computing systems are considered. The presented parallel technology is based on a multilevel parallel model combining various types of parallelism: with shared and distributed memory and with multiple and single instruction streams to multiple data flows.

  4. Does one size really fit all? The effectiveness of a non-diagnosis-specific integrated mental health care program in Germany in a prospective, parallel-group controlled multi-centre trial.

    PubMed

    Mueller-Stierlin, Annabel Sandra; Helmbrecht, Marina Julia; Herder, Katrin; Prinz, Stefanie; Rosenfeld, Nadine; Walendzik, Julia; Holzmann, Marco; Dinc, Uemmueguelsuem; Schützwohl, Matthias; Becker, Thomas; Kilian, Reinhold

    2017-08-01

    The Network for Mental Health (NWpG-IC) is an integrated mental health care program implemented in 2009 by cooperation between health insurance companies and community mental health providers in Germany. Meanwhile about 10,000 patients have been enrolled. This is the first study evaluating the effectiveness of the program in comparison to standard mental health care in Germany. In a parallel-group controlled trial over 18 months conducted in five regions across Germany, a total of 260 patients enrolled in NWpG-IC and 251 patients in standard mental health care (TAU) were recruited between August 2013 and November 2014. The NWpG-IC patients had access to special services such as community-based multi-professional teams, case management, crisis intervention and family-oriented psychoeducation in addition to standard mental health care. The primary outcome empowerment (EPAS) and the secondary outcomes quality of life (WHO-QoL-BREF), satisfaction with psychiatric treatment (CSQ-8), psychosocial and clinical impairment (HoNOS) and information about mental health service needs (CAN) were measured four times at 6-month intervals. Linear mixed-effect regression models were used to estimate the main effects and interaction effects of treatment, time and primary diagnosis. Due to the non-randomised group assignment, propensity score adjustment was used to control the selection bias. NWpG-IC and TAU groups did not differ with respect to most primary and secondary outcomes in our participating patients who showed a broad spectrum of psychiatric diagnoses and illness severities. However, a significant improvement in terms of patients' satisfaction with psychiatric care and their perception of treatment participation in favour of the NWpG-IC group was found. Providing integrated mental health care for unspecific mentally ill target groups increases treatment participation and service satisfaction but seems not suitable to enhance the overall outcomes of mental health care in

  5. “Everybody Brush!”: Protocol for a Parallel-Group Randomized Controlled Trial of a Family-Focused Primary Prevention Program With Distribution of Oral Hygiene Products and Education to Increase Frequency of Toothbrushing

    PubMed Central

    2015-01-01

    Background Twice daily toothbrushing with fluoridated toothpaste is the most widely advocated preventive strategy for dental caries (tooth decay) and is recommended by professional dental associations. Not all parents, children, or adolescents follow this recommendation. This protocol describes the methods for the implementation and evaluation of a quality improvement health promotion program. Objective The objective of the study is to show a theory-informed, evidence-based program to improve twice daily toothbrushing and oral health-related quality of life that may reduce dental caries, dental treatment need, and costs. Methods The design is a parallel-group, pragmatic randomized controlled trial. Families of Medicaid-insured children and adolescents within a large dental care organization in central Oregon will participate in the trial (n=21,743). Families will be assigned to one of three groups: a test intervention, an active control, or a passive control condition. The intervention aims to address barriers and support for twice-daily toothbrushing. Families in the test condition will receive toothpaste and toothbrushes by mail for all family members every three months. In addition, they will receive education and social support to encourage toothbrushing via postcards, recorded telephone messages, and an optional participant-initiated telephone helpline. Families in the active control condition will receive the kit of supplies by mail, but no additional instructional information or telephone support. Families assigned to the passive control will be on a waiting list. The primary outcomes are restorative dental care received and, only for children younger than 36 months old at baseline, the frequency of twice-daily toothbrushing. Data will be collected through dental claims records and, for children younger than 36 months old at baseline, parent interviews and clinical exams. Results Enrollment of participants and baseline interviews have been completed. Final

  6. SKIRT: Hybrid parallelization of radiative transfer simulations

    NASA Astrophysics Data System (ADS)

    Verstocken, S.; Van De Putte, D.; Camps, P.; Baes, M.

    2017-07-01

    We describe the design, implementation and performance of the new hybrid parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which has been used extensively for modelling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori. The hybrid scheme combines distributed memory parallelization, using the standard Message Passing Interface (MPI) to communicate between processes, and shared memory parallelization, providing multiple execution threads within each process to avoid duplication of data structures. The synchronization between multiple threads is accomplished through atomic operations without high-level locking (also called lock-free programming). This improves the scaling behaviour of the code and substantially simplifies the implementation of the hybrid scheme. The result is an extremely flexible solution that adjusts to the number of available nodes, processors and memory, and consequently performs well on a wide variety of computing architectures.

  7. Extending HPF for advanced data parallel applications

    NASA Technical Reports Server (NTRS)

    Chapman, Barbara; Mehrotra, Piyush; Zima, Hans

    1994-01-01

    The stated goal of High Performance Fortran (HPF) was to 'address the problems of writing data parallel programs where the distribution of data affects performance'. After examining the current version of the language we are led to the conclusion that HPF has not fully achieved this goal. While the basic distribution functions offered by the language - regular block, cyclic, and block cyclic distributions - can support regular numerical algorithms, advanced applications such as particle-in-cell codes or unstructured mesh solvers cannot be expressed adequately. We believe that this is a major weakness of HPF, significantly reducing its chances of becoming accepted in the numeric community. The paper discusses the data distribution and alignment issues in detail, points out some flaws in the basic language, and outlines possible future paths of development. Furthermore, we briefly deal with the issue of task parallelism and its integration with the data parallel paradigm of HPF.

  8. The EMCC / DARPA Massively Parallel Electromagnetic Scattering Project

    NASA Technical Reports Server (NTRS)

    Woo, Alex C.; Hill, Kueichien C.

    1996-01-01

    The Electromagnetic Code Consortium (EMCC) was sponsored by the Advanced Research Program Agency (ARPA) to demonstrate the effectiveness of massively parallel computing in large scale radar signature predictions. The EMCC/ARPA project consisted of three parts.

  9. QCD on the Massively Parallel Computer AP1000

    NASA Astrophysics Data System (ADS)

    Akemi, K.; Fujisaki, M.; Okuda, M.; Tago, Y.; Hashimoto, T.; Hioki, S.; Miyamura, O.; Takaishi, T.; Nakamura, A.; de Forcrand, Ph.; Hege, C.; Stamatescu, I. O.

    We present the QCD-TARO program of calculations which uses the parallel computer AP1000 of Fujitsu. We discuss the results on scaling, correlation times and hadronic spectrum, some aspects of the implementation and the future prospects.

  10. Trajectory optimization using parallel shooting method on parallel computer

    SciTech Connect

    Wirthman, D.J.; Park, S.Y.; Vadali, S.R.

    1995-03-01

    The efficiency of a parallel shooting method on a parallel computer for solving a variety of optimal control guidance problems is studied. Several examples are considered to demonstrate that a speedup of nearly 7 to 1 is achieved with the use of 16 processors. It is suggested that further improvements in performance can be achieved by parallelizing in the state domain. 10 refs.

  11. The implementation of bit-parallelism for DNA sequence alignment

    NASA Astrophysics Data System (ADS)

    Setyorini; Kuspriyanto; Widyantoro, D. H.; Pancoro, A.

    2017-05-01

    Dynamic Programming (DP) remain the central algorithm of biological sequence alignment. Matching score computation is the most time-consuming process. Bit-parallelism is one of approximate string matching techniques that transform DP matrix cell unit processing into word unit (groups of cell). Bit-parallelism computate the scores column-wise. Adopting from word processing in computer system work, this technique promise reducing time in score computing process in DP matrix. In this paper, we implement bit-parallelism technique for DNA sequence alignment. Our bit-parallelism implementation have less time for score computational process but still need improvement for there construction process.

  12. Parallel Computing for Probabilistic Response Analysis of High Temperature Composites

    NASA Technical Reports Server (NTRS)

    Sues, R. H.; Lua, Y. J.; Smith, M. D.

    1994-01-01

    The objective of this Phase I research was to establish the required software and hardware strategies to achieve large scale parallelism in solving PCM problems. To meet this objective, several investigations were conducted. First, we identified the multiple levels of parallelism in PCM and the computational strategies to exploit these parallelisms. Next, several software and hardware efficiency investigations were conducted. These involved the use of three different parallel programming paradigms and solution of two example problems on both a shared-memory multiprocessor and a distributed-memory network of workstations.

  13. A massively parallel memory-based story system for psychotherapy.

    PubMed

    Smith, R N; Chen, C C; Feng, F F; Gomez-Gauchia, H

    1993-10-01

    We describe a memory-based system for psychotherapy, Dr. Bob, built to run on the data parallel processor Thinking Machines, Inc., CM-2a Connection Machine. The system retrieves, in parallel, stories of alcohol addiction and sexual abuse which can be used by psychiatrists in working with their patients as part of their work in recovering from addictive behavior and psychological trauma. The program is written in *LISP (pronounced Star LISP), a version of LISP used in programming Connection Machines.

  14. The parallel I/O architecture of the high performance storage system (HPSS). Revision 1

    SciTech Connect

    Watson, R.W.; Coyne, R.A.

    1995-04-01

    Datasets up to terabyte size and petabyte capacities have created a serious imbalance between I/O and storage system performance and system functionality. One promising approach is the use of parallel data transfer techniques for client access to storage, peripheral-to-peripheral transfers, and remote file transfers. This paper describes the parallel I/O architecture and mechanisms, Parallel Transport Protocol (PTP), parallel FTP, and parallel client Application Programming Interface (API) used by the High Performance Storage System (HPSS). Parallel storage integration issues with a local parallel file system are also discussed.

  15. Resistor Combinations for Parallel Circuits.

    ERIC Educational Resources Information Center

    McTernan, James P.

    1978-01-01

    To help simplify both teaching and learning of parallel circuits, a high school electricity/electronics teacher presents and illustrates the use of tables of values for parallel resistive circuits in which total resistances are whole numbers. (MF)

  16. Status of TRANSP Parallel Services

    NASA Astrophysics Data System (ADS)

    Indireshkumar, K.; Andre, Robert; McCune, Douglas; Randerson, Lewis

    2006-10-01

    The PPPL TRANSP code suite has been used successfully over many years to carry out time dependent simulations of tokamak plasmas. However, accurately modeling certain phenomena such as RF heating and fast ion behavior using TRANSP requires extensive computational power and will benefit from parallelization. Parallelizing all of TRANSP is not required and parts will run sequentially while other parts run parallelized. To efficiently use a site's parallel services, the parallelized TRANSP modules are deployed to a shared ``parallel service'' on a separate cluster. The PPPL Monte Carlo fast ion module NUBEAM and the MIT RF module TORIC are the first TRANSP modules to be so deployed. This poster will show the performance scaling of these modules within the parallel server. Communications between the serial client and the parallel server will be described in detail, and measurements of startup and communications overhead will be shown. Physics modeling benefits for TRANSP users will be assessed.

  17. Asynchronous interpretation of parallel microprograms

    SciTech Connect

    Bandman, O.L.

    1984-03-01

    In this article, the authors demonstrate how to pass from a given synchronous interpretation of a parallel microprogram to an equivalent asynchronous interpretation, and investigate the cost associated with the rejection of external synchronization in parallel microprogram structures.

  18. The Galley Parallel File System

    NASA Technical Reports Server (NTRS)

    Nieuwejaar, Nils; Kotz, David

    1996-01-01

    As the I/O needs of parallel scientific applications increase, file systems for multiprocessors are being designed to provide applications with parallel access to multiple disks. Many parallel file systems present applications with a conventional Unix-like interface that allows the application to access multiple disks transparently. The interface conceals the parallelism within the file system, which increases the ease of programmability, but makes it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. Furthermore, most current parallel file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic parallel workloads. We discuss Galley's file structure and application interface, as well as an application that has been implemented using that interface.

  19. Resistor Combinations for Parallel Circuits.

    ERIC Educational Resources Information Center

    McTernan, James P.

    1978-01-01

    To help simplify both teaching and learning of parallel circuits, a high school electricity/electronics teacher presents and illustrates the use of tables of values for parallel resistive circuits in which total resistances are whole numbers. (MF)

  20. Linear Bregman algorithm implemented in parallel GPU

    NASA Astrophysics Data System (ADS)

    Li, Pengyan; Ke, Jue; Sui, Dong; Wei, Ping

    2015-08-01

    At present, most compressed sensing (CS) algorithms have poor converging speed, thus are difficult to run on PC. To deal with this issue, we use a parallel GPU, to implement a broadly used compressed sensing algorithm, the Linear Bregman algorithm. Linear iterative Bregman algorithm is a reconstruction algorithm proposed by Osher and Cai. Compared with other CS reconstruction algorithms, the linear Bregman algorithm only involves the vector and matrix multiplication and thresholding operation, and is simpler and more efficient for programming. We use C as a development language and adopt CUDA (Compute Unified Device Architecture) as parallel computing architectures. In this paper, we compared the parallel Bregman algorithm with traditional CPU realized Bregaman algorithm. In addition, we also compared the parallel Bregman algorithm with other CS reconstruction algorithms, such as OMP and TwIST algorithms. Compared with these two algorithms, the result of this paper shows that, the parallel Bregman algorithm needs shorter time, and thus is more convenient for real-time object reconstruction, which is important to people's fast growing demand to information technology.

  1. Parallel software support for computational structural mechanics

    NASA Technical Reports Server (NTRS)

    Jordan, Harry F.

    1987-01-01

    The application of the parallel programming methodology known as the Force was conducted. Two application issues were addressed. The first involves the efficiency of the implementation and its completeness in terms of satisfying the needs of other researchers implementing parallel algorithms. Support for, and interaction with, other Computational Structural Mechanics (CSM) researchers using the Force was the main issue, but some independent investigation of the Barrier construct, which is extremely important to overall performance, was also undertaken. Another efficiency issue which was addressed was that of relaxing the strong synchronization condition imposed on the self-scheduled parallel DO loop. The Force was extended by the addition of logical conditions to the cases of a parallel case construct and by the inclusion of a self-scheduled version of this construct. The second issue involved applying the Force to the parallelization of finite element codes such as those found in the NICE/SPAR testbed system. One of the more difficult problems encountered is the determination of what information in COMMON blocks is actually used outside of a subroutine and when a subroutine uses a COMMON block merely as scratch storage for internal temporary results.

  2. Equalizer: a scalable parallel rendering framework.

    PubMed

    Eilemann, Stefan; Makhinya, Maxim; Pajarola, Renato

    2009-01-01

    Continuing improvements in CPU and GPU performances as well as increasing multi-core processor and cluster-based parallelism demand for flexible and scalable parallel rendering solutions that can exploit multipipe hardware accelerated graphics. In fact, to achieve interactive visualization, scalable rendering systems are essential to cope with the rapid growth of data sets. However, parallel rendering systems are non-trivial to develop and often only application specific implementations have been proposed. The task of developing a scalable parallel rendering framework is even more difficult if it should be generic to support various types of data and visualization applications, and at the same time work efficiently on a cluster with distributed graphics cards. In this paper we introduce a novel system called Equalizer, a toolkit for scalable parallel rendering based on OpenGL which provides an application programming interface (API) to develop scalable graphics applications for a wide range of systems ranging from large distributed visualization clusters and multi-processor multipipe graphics systems to single-processor single-pipe desktop machines. We describe the system architecture, the basic API, discuss its advantages over previous approaches, present example configurations and usage scenarios as well as scalability results.

  3. The Structure of Parallel Algorithms.

    DTIC Science & Technology

    1979-08-01

    parallel architectures and parallel algorithms see [Anderson and Jensen 75, Stone 75, Kung 76, Enslow 77, Kuck 77, Ramamoorthy and Li 77, Sameh 77, Heller...the Routing Time on a Parallel Computer with a Fixed Interconnection Network, In Kuck., D. J., Lawrie, D.H. and Sameh , A.H., editor, High Speed...Letters 5(4):107-112, October 1976. [ Sameh 77] Sameh , A.H. Numerical Parallel Algorithms -- A Survey. In Hifh Speed Computer and AlgorLthm Organization

  4. Parallel unification scheduling in Prolog. Technical report

    SciTech Connect

    Citrin

    1986-09-18

    Unification, the fundamental operation in the Prolog logic programming language can take up to 50% of the execution time of a typical Prolog system. One approach to speeding up the unification operation is to perform it on parallel hardware. Although it has been shown that, in general, there is no parallel algorithm for unification that is better than the best sequential algorithm, there is a substantial subset of unification which may be done in parallel. Identifying these subsets involves gathering data using an extension of Chang's static data-dependency analysis (SDDA), then using that data to schedule the components of a unification for parallel unification. Improvements to the information gathered by SDDA may be achieved through procedures splitting, a source-level transformation of the program. This thesis describes and evaluates the above-mentioned techniques and their implementation. Results are compared to other techniques for speeding up unification. Ways in which these techniques may be applied to the Berkeley PLM machine are also described.

  5. Parallel and Portable Monte Carlo Particle Transport

    NASA Astrophysics Data System (ADS)

    Lee, S. R.; Cummings, J. C.; Nolen, S. D.; Keen, N. D.

    1997-08-01

    We have developed a multi-group, Monte Carlo neutron transport code in C++ using object-oriented methods and the Parallel Object-Oriented Methods and Applications (POOMA) class library. This transport code, called MC++, currently computes k and α eigenvalues of the neutron transport equation on a rectilinear computational mesh. It is portable to and runs in parallel on a wide variety of platforms, including MPPs, clustered SMPs, and individual workstations. It contains appropriate classes and abstractions for particle transport and, through the use of POOMA, for portable parallelism. Current capabilities are discussed, along with physics and performance results for several test problems on a variety of hardware, including all three Accelerated Strategic Computing Initiative (ASCI) platforms. Current parallel performance indicates the ability to compute α-eigenvalues in seconds or minutes rather than days or weeks. Current and future work on the implementation of a general transport physics framework (TPF) is also described. This TPF employs modern C++ programming techniques to provide simplified user interfaces, generic STL-style programming, and compile-time performance optimization. Physics capabilities of the TPF will be extended to include continuous energy treatments, implicit Monte Carlo algorithms, and a variety of convergence acceleration techniques such as importance combing.

  6. Massively Parallel Genetics.

    PubMed

    Shendure, Jay; Fields, Stanley

    2016-06-01

    Human genetics has historically depended on the identification of individuals whose natural genetic variation underlies an observable trait or disease risk. Here we argue that new technologies now augment this historical approach by allowing the use of massively parallel assays in model systems to measure the functional effects of genetic variation in many human genes. These studies will help establish the disease risk of both observed and potential genetic variants and to overcome the problem of "variants of uncertain significance." Copyright © 2016 by the Genetics Society of America.

  7. Highly parallel computation

    NASA Technical Reports Server (NTRS)

    Denning, Peter J.; Tichy, Walter F.

    1990-01-01

    Highly parallel computing architectures are the only means to achieve the computation rates demanded by advanced scientific problems. A decade of research has demonstrated the feasibility of such machines and current research focuses on which architectures designated as multiple instruction multiple datastream (MIMD) and single instruction multiple datastream (SIMD) have produced the best results to date; neither shows a decisive advantage for most near-homogeneous scientific problems. For scientific problems with many dissimilar parts, more speculative architectures such as neural networks or data flow may be needed.

  8. Parallel sphere rendering

    SciTech Connect

    Krogh, M.; Hansen, C.; Painter, J.; de Verdiere, G.C.

    1995-05-01

    Sphere rendering is an important method for visualizing molecular dynamics data. This paper presents a parallel divide-and-conquer algorithm that is almost 90 times faster than current graphics workstations. To render extremely large data sets and large images, the algorithm uses the MIMD features of the supercomputers to divide up the data, render independent partial images, and then finally composite the multiple partial images using an optimal method. The algorithm and performance results are presented for the CM-5 and the T3D.

  9. Roo: A parallel theorem prover

    SciTech Connect

    Lusk, E.L.; McCune, W.W.; Slaney, J.K.

    1991-11-01

    We describe a parallel theorem prover based on the Argonne theorem-proving system OTTER. The parallel system, called Roo, runs on shared-memory multiprocessors such as the Sequent Symmetry. We explain the parallel algorithm used and give performance results that demonstrate near-linear speedups on large problems.

  10. CSM parallel structural methods research

    NASA Technical Reports Server (NTRS)

    Storaasli, Olaf O.

    1989-01-01

    Parallel structural methods, research team activities, advanced architecture computers for parallel computational structural mechanics (CSM) research, the FLEX/32 multicomputer, a parallel structural analyses testbed, blade-stiffened aluminum panel with a circular cutout and the dynamic characteristics of a 60 meter, 54-bay, 3-longeron deployable truss beam are among the topics discussed.

  11. Parallel transformation of K-SVD solar image denoising algorithm

    NASA Astrophysics Data System (ADS)

    Liang, Youwen; Tian, Yu; Li, Mei

    2017-02-01

    The images obtained by observing the sun through a large telescope always suffered with noise due to the low SNR. K-SVD denoising algorithm can effectively remove Gauss white noise. Training dictionaries for sparse representations is a time consuming task, due to the large size of the data involved and to the complexity of the training algorithms. In this paper, an OpenMP parallel programming language is proposed to transform the serial algorithm to the parallel version. Data parallelism model is used to transform the algorithm. Not one atom but multiple atoms updated simultaneously is the biggest change. The denoising effect and acceleration performance are tested after completion of the parallel algorithm. Speedup of the program is 13.563 in condition of using 16 cores. This parallel version can fully utilize the multi-core CPU hardware resources, greatly reduce running time and easily to transplant in multi-core platform.

  12. Review of parallel computing methods and tools for FPGA technology

    NASA Astrophysics Data System (ADS)

    Cieszewski, Radosław; Linczuk, Maciej; Pozniak, Krzysztof; Romaniuk, Ryszard

    2013-10-01

    Parallel computing is emerging as an important area of research in computer architectures and software systems. Many algorithms can be greatly accelerated using parallel computing techniques. Specialized parallel computer architectures are used for accelerating speci c tasks. High-Energy Physics Experiments measuring systems often use FPGAs for ne-grained computation. FPGA combines many bene ts of both software and ASIC implementations. Like software, the mapped circuit is exible, and can be recon gured over the lifetime of the system. FPGAs therefore have the potential to achieve far greater performance than software as a result of bypassing the fetch-decode-execute operations of traditional processors, and possibly exploiting a greater level of parallelism. Creating parallel programs implemented in FPGAs is not trivial. This paper presents existing methods and tools for ne-grained computation implemented in FPGA using Behavioral Description and High Level Programming Languages.

  13. Parallel image computation in clusters with task-distributor.

    PubMed

    Baun, Christian

    2016-01-01

    Distributed systems, especially clusters, can be used to execute ray tracing tasks in parallel for speeding up the image computation. Because ray tracing is a computational expensive and memory consuming task, ray tracing can also be used to benchmark clusters. This paper introduces task-distributor, a free software solution for the parallel execution of ray tracing tasks in distributed systems. The ray tracing solution used for this work is the Persistence Of Vision Raytracer (POV-Ray). Task-distributor does not require any modification of the POV-Ray source code or the installation of an additional message passing library like the Message Passing Interface or Parallel Virtual Machine to allow parallel image computation, in contrast to various other projects. By analyzing the runtime of the sequential and parallel program parts of task-distributor, it becomes clear how the problem size and available hardware resources influence the scaling of the parallel application.

  14. Parallel community climate model: Description and user`s guide

    SciTech Connect

    Drake, J.B.; Flanery, R.E.; Semeraro, B.D.; Worley, P.H.

    1996-07-15

    This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain into geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.

  15. Benchmarking massively parallel architectures

    SciTech Connect

    Lubeck, O.; Moore, J.; Simmons, M.; Wasserman, H.

    1993-01-01

    The purpose of this paper is to summarize some initial experiences related to measuring the performance of massively parallel processors (MPPs) at Los Alamos National Laboratory (LANL). Actually, the range of MPP architectures the authors have used is rather limited, being confined mostly to the Thinking Machines Corporation (TMC) Connection Machine CM-2 and CM-5. Some very preliminary work has been carried out on the Kendall Square KSR-1, and efforts related to other machines, such as the Intel Paragon and the soon-to-be-released CRAY T3D are planned. This paper will concentrate more on methodology rather than discuss specific architectural strengths and weaknesses; the latter is expected to be the subject of future reports. MPP benchmarking is a field in critical need of structure and definition. As the authors have stated previously, such machines have enormous potential, and there is certainly a dire need for orders of magnitude computational power over current supercomputers. However, performance reports for MPPs must emphasize actual sustainable performance from real applications in a careful, responsible manner. Such has not always been the case. A recent paper has described in some detail, the problem of potentially misleading performance reporting in the parallel scientific computing field. Thus, in this paper, the authors briefly offer a few general ideas on MPP performance analysis.

  16. Parallelizing quantum circuit synthesis

    NASA Astrophysics Data System (ADS)

    Di Matteo, Olivia; Mosca, Michele

    2016-03-01

    Quantum circuit synthesis is the process in which an arbitrary unitary operation is decomposed into a sequence of gates from a universal set, typically one which a quantum computer can implement both efficiently and fault-tolerantly. As physical implementations of quantum computers improve, the need is growing for tools that can effectively synthesize components of the circuits and algorithms they will run. Existing algorithms for exact, multi-qubit circuit synthesis scale exponentially in the number of qubits and circuit depth, leaving synthesis intractable for circuits on more than a handful of qubits. Even modest improvements in circuit synthesis procedures may lead to significant advances, pushing forward the boundaries of not only the size of solvable circuit synthesis problems, but also in what can be realized physically as a result of having more efficient circuits. We present a method for quantum circuit synthesis using deterministic walks. Also termed pseudorandom walks, these are walks in which once a starting point is chosen, its path is completely determined. We apply our method to construct a parallel framework for circuit synthesis, and implement one such version performing optimal T-count synthesis over the Clifford+T gate set. We use our software to present examples where parallelization offers a significant speedup on the runtime, as well as directly confirm that the 4-qubit 1-bit full adder has optimal T-count 7 and T-depth 3.

  17. Parallel Eigenvalue extraction

    NASA Technical Reports Server (NTRS)

    Akl, Fred A.

    1989-01-01

    A new numerical algorithm for the solution of large-order eigenproblems typically encountered in linear elastic finite element systems is presented. The architecture of parallel processing is utilized in the algorithm to achieve increased speed and efficiency of calculations. The algorithm is based on the frontal technique for the solution of linear simultaneous equations and the modified subspace eigenanalysis method for the solution of the eigenproblem. Assembly, elimination and back-substitution of degrees of freedom are performed concurrently, using a number of fronts. All fronts converge to and diverge from a predefined global front during elimination and back-substitution, respectively. In the meantime, reduction of the stiffness and mass matrices required by the modified subspace method can be completed during the convergence/divergence cycle and an estimate of the required eigenpairs obtained. Successive cycles of convergence and divergence are repeated until the desired accuracy of calculations is achieved. The advantages of this new algorithm in parallel computer architecture are discussed.

  18. Parallel ptychographic reconstruction

    PubMed Central

    Nashed, Youssef S. G.; Vine, David J.; Peterka, Tom; Deng, Junjing; Ross, Rob; Jacobsen, Chris

    2014-01-01

    Ptychography is an imaging method whereby a coherent beam is scanned across an object, and an image is obtained by iterative phasing of the set of diffraction patterns. It is able to be used to image extended objects at a resolution limited by scattering strength of the object and detector geometry, rather than at an optics-imposed limit. As technical advances allow larger fields to be imaged, computational challenges arise for reconstructing the correspondingly larger data volumes, yet at the same time there is also a need to deliver reconstructed images immediately so that one can evaluate the next steps to take in an experiment. Here we present a parallel method for real-time ptychographic phase retrieval. It uses a hybrid parallel strategy to divide the computation between multiple graphics processing units (GPUs) and then employs novel techniques to merge sub-datasets into a single complex phase and amplitude image. Results are shown on a simulated specimen and a real dataset from an X-ray experiment conducted at a synchrotron light source. PMID:25607174

  19. Parallel ptychographic reconstruction

    SciTech Connect

    Nashed, Youssef S. G.; Vine, David J.; Peterka, Tom; Deng, Junjing; Ross, Rob; Jacobsen, Chris

    2014-12-19

    Ptychography is an imaging method whereby a coherent beam is scanned across an object, and an image is obtained by iterative phasing of the set of diffraction patterns. It is able to be used to image extended objects at a resolution limited by scattering strength of the object and detector geometry, rather than at an optics-imposed limit. As technical advances allow larger fields to be imaged, computational challenges arise for reconstructing the correspondingly larger data volumes, yet at the same time there is also a need to deliver reconstructed images immediately so that one can evaluate the next steps to take in an experiment. Here we present a parallel method for real-time ptychographic phase retrieval. It uses a hybrid parallel strategy to divide the computation between multiple graphics processing units (GPUs) and then employs novel techniques to merge sub-datasets into a single complex phase and amplitude image. Results are shown on a simulated specimen and a real dataset from an X-ray experiment conducted at a synchrotron light source.

  20. Applied Parallel Metadata Indexing

    SciTech Connect

    Jacobi, Michael R

    2012-08-01

    The GPFS Archive is parallel archive is a parallel archive used by hundreds of users in the Turquoise collaboration network. It houses 4+ petabytes of data in more than 170 million files. Currently, users must navigate the file system to retrieve their data, requiring them to remember file paths and names. A better solution might allow users to tag data with meaningful labels and searach the archive using standard and user-defined metadata, while maintaining security. last summer, I developed the backend to a tool that adheres to these design goals. The backend works by importing GPFS metadata into a MongoDB cluster, which is then indexed on each attribute. This summer, the author implemented security and developed the user interfae for the search tool. To meet security requirements, each database table is associated with a single user, which only stores records that the user may read, and requires a set of credentials to access. The interface to the search tool is implemented using FUSE (Filesystem in USErspace). FUSE is an intermediate layer that intercepts file system calls and allows the developer to redefine how those calls behave. In the case of this tool, FUSE interfaces with MongoDB to issue queries and populate output. A FUSE implementation is desirable because it allows users to interact with the search tool using commands they are already familiar with. These security and interface additions are essential for a usable product.