Science.gov

Sample records for parallel programming ppopp

  1. Parallel programming with PCN

    SciTech Connect

    Foster, I.; Tuecke, S.

    1991-12-01

    PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and C that allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. In includes both tutorial and reference material. It also presents the basic concepts that underly PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous FTP from Argonne National Laboratory in the directory pub/pcn at info.mcs.anl.gov (c.f. Appendix A).

  2. Bilingual parallel programming

    SciTech Connect

    Foster, I.; Overbeek, R.

    1990-01-01

    Numerous experiments have demonstrated that computationally intensive algorithms support adequate parallelism to exploit the potential of large parallel machines. Yet successful parallel implementations of serious applications are rare. The limiting factor is clearly programming technology. None of the approaches to parallel programming that have been proposed to date -- whether parallelizing compilers, language extensions, or new concurrent languages -- seem to adequately address the central problems of portability, expressiveness, efficiency, and compatibility with existing software. In this paper, we advocate an alternative approach to parallel programming based on what we call bilingual programming. We present evidence that this approach provides and effective solution to parallel programming problems. The key idea in bilingual programming is to construct the upper levels of applications in a high-level language while coding selected low-level components in low-level languages. This approach permits the advantages of a high-level notation (expressiveness, elegance, conciseness) to be obtained without the cost in performance normally associated with high-level approaches. In addition, it provides a natural framework for reusing existing code.

  3. Parallel programming with PCN

    SciTech Connect

    Foster, I.; Tuecke, S.

    1993-01-01

    PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and Cthat allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. It includes both tutorial and reference material. It also presents the basic concepts that underlie PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous ftp from Argonne National Laboratory in the directory pub/pcn at info.mcs. ani.gov (cf. Appendix A). This version of this document describes PCN version 2.0, a major revision of the PCN programming system. It supersedes earlier versions of this report.

  4. Parallel programming with Ada

    SciTech Connect

    Kok, J.

    1988-01-01

    To the human programmer the ease of coding distributed computing is highly dependent on the suitability of the employed programming language. But with a particular language it is also important whether the possibilities of one or more parallel architectures can efficiently be addressed by available language constructs. In this paper the possibilities are discussed of the high-level language Ada and in particular of its tasking concept as a descriptional tool for the design and implementation of numerical and other algorithms that allow execution of parts in parallel. Language tools are explained and their use for common applications is shown. Conclusions are drawn about the usefulness of several Ada concepts.

  5. Tolerant (parallel) Programming

    NASA Technical Reports Server (NTRS)

    DiNucci, David C.; Bailey, David H. (Technical Monitor)

    1997-01-01

    In order to be truly portable, a program must be tolerant of a wide range of development and execution environments, and a parallel program is just one which must be tolerant of a very wide range. This paper first defines the term "tolerant programming", then describes many layers of tools to accomplish it. The primary focus is on F-Nets, a formal model for expressing computation as a folded partial-ordering of operations, thereby providing an architecture-independent expression of tolerant parallel algorithms. For implementing F-Nets, Cooperative Data Sharing (CDS) is a subroutine package for implementing communication efficiently in a large number of environments (e.g. shared memory and message passing). Software Cabling (SC), a very-high-level graphical programming language for building large F-Nets, possesses many of the features normally expected from today's computer languages (e.g. data abstraction, array operations). Finally, L2(sup 3) is a CASE tool which facilitates the construction, compilation, execution, and debugging of SC programs.

  6. Parallel execution of LISP programs

    SciTech Connect

    Weening, J.S.

    1989-01-01

    This dissertation considers several issues in the execution of Lisp programs on shared-memory multiprocessors. An overview of constructs for explicit parallelism in Lisp is first presented. The problems of partitioning a program into processes and scheduling these processes are then described, and a number of methods for performing these are proposed. These include cutting off process creation based on properties of the computation tree of the program, and basing partitioning decisions on the state of the system at runtime instead of the program. An experimental study of these methods has been performed using a simulator for parallel Lisp. The simulator, written in common Lisp using a continuation-passing style, is described in detail. This is followed by a description of the experiments that were performed and an analysis of the results. Two programs are used as illustrations-a Fast Fourier Transform, which has an abundance of parallelism, and the Cocke-Younger-Kasami parsing algorithm, for which good speedup is not as easy to obtain. The difficulty of using cutoff-based partitioning methods, and the differences between various scheduling methods, are shown. A combination of partitioning and scheduling methods which the author calls dynamic partitioning is analyzed in more detail. This method is based on examining the machine's runtime state; it requires that the programmer only identify parallelism in the program, without deciding which potential parallelism is actually useful. Several theorems are proved providing upper bounds on the amount of overhead produced by this method. He concludes that for programs whose computation trees have small height relative to their total size, dynamic partitioning can achieve asymptotically minimal overhead in the cost of process creation.

  7. Information hiding in parallel programs

    SciTech Connect

    Foster, I.

    1992-01-30

    A fundamental principle in program design is to isolate difficult or changeable design decisions. Application of this principle to parallel programs requires identification of decisions that are difficult or subject to change, and the development of techniques for hiding these decisions. We experiment with three complex applications, and identify mapping, communication, and scheduling as areas in which decisions are particularly problematic. We develop computational abstractions that hide such decisions, and show that these abstractions can be used to develop elegant solutions to programming problems. In particular, they allow us to encode common structures, such as transforms, reductions, and meshes, as software cells and templates that can reused in different applications. An important characteristic of these structures is that they do not incorporate mapping, communication, or scheduling decisions: these aspects of the design are specified separately, when composing existing structures to form applications. This separation of concerns allows the same cells and templates to be reused in different contexts.

  8. Global Arrays Parallel Programming Toolkit

    SciTech Connect

    Nieplocha, Jaroslaw; Krishnan, Manoj Kumar; Palmer, Bruce J.; Tipparaju, Vinod; Harrison, Robert J.; Chavarría-Miranda, Daniel

    2011-01-01

    The two predominant classes of programming models for parallel computing are distributed memory and shared memory. Both shared memory and distributed memory models have advantages and shortcomings. Shared memory model is much easier to use but it ignores data locality/placement. Given the hierarchical nature of the memory subsystems in modern computers this characteristic can have a negative impact on performance and scalability. Careful code restructuring to increase data reuse and replacing fine grain load/stores with block access to shared data can address the problem and yield performance for shared memory that is competitive with message-passing. However, this performance comes at the cost of compromising the ease of use that the shared memory model advertises. Distributed memory models, such as message-passing or one-sided communication, offer performance and scalability but they are difficult to program. The Global Arrays toolkit attempts to offer the best features of both models. It implements a shared-memory programming model in which data locality is managed by the programmer. This management is achieved by calls to functions that transfer data between a global address space (a distributed array) and local storage. In this respect, the GA model has similarities to the distributed shared-memory models that provide an explicit acquire/release protocol. However, the GA model acknowledges that remote data is slower to access than local data and allows data locality to be specified by the programmer and hence managed. GA is related to the global address space languages such as UPC, Titanium, and, to a lesser extent, Co-Array Fortran. In addition, by providing a set of data-parallel operations, GA is also related to data-parallel languages such as HPF, ZPL, and Data Parallel C. However, the Global Array programming model is implemented as a library that works with most languages used for technical computing and does not rely on compiler technology for achieving

  9. The ParaScope parallel programming environment

    NASA Technical Reports Server (NTRS)

    Cooper, Keith D.; Hall, Mary W.; Hood, Robert T.; Kennedy, Ken; Mckinley, Kathryn S.; Mellor-Crummey, John M.; Torczon, Linda; Warren, Scott K.

    1993-01-01

    The ParaScope parallel programming environment, developed to support scientific programming of shared-memory multiprocessors, includes a collection of tools that use global program analysis to help users develop and debug parallel programs. This paper focuses on ParaScope's compilation system, its parallel program editor, and its parallel debugging system. The compilation system extends the traditional single-procedure compiler by providing a mechanism for managing the compilation of complete programs. Thus, ParaScope can support both traditional single-procedure optimization and optimization across procedure boundaries. The ParaScope editor brings both compiler analysis and user expertise to bear on program parallelization. It assists the knowledgeable user by displaying and managing analysis and by providing a variety of interactive program transformations that are effective in exposing parallelism. The debugging system detects and reports timing-dependent errors, called data races, in execution of parallel programs. The system combines static analysis, program instrumentation, and run-time reporting to provide a mechanical system for isolating errors in parallel program executions. Finally, we describe a new project to extend ParaScope to support programming in FORTRAN D, a machine-independent parallel programming language intended for use with both distributed-memory and shared-memory parallel computers.

  10. Parallel Programming in the Age of Ubiquitous Parallelism

    NASA Astrophysics Data System (ADS)

    Pingali, Keshav

    2014-04-01

    Multicore and manycore processors are now ubiquitous, but parallel programming remains as difficult as it was 30-40 years ago. During this time, our community has explored many promising approaches including functional and dataflow languages, logic programming, and automatic parallelization using program analysis and restructuring, but none of these approaches has succeeded except in a few niche application areas. In this talk, I will argue that these problems arise largely from the computation-centric foundations and abstractions that we currently use to think about parallelism. In their place, I will propose a novel data-centric foundation for parallel programming called the operator formulation in which algorithms are described in terms of actions on data. The operator formulation shows that a generalized form of data-parallelism called amorphous data-parallelism is ubiquitous even in complex, irregular graph applications such as mesh generation/refinement/partitioning and SAT solvers. Regular algorithms emerge as a special case of irregular ones, and many application-specific optimization techniques can be generalized to a broader context. The operator formulation also leads to a structural analysis of algorithms called TAO-analysis that provides implementation guidelines for exploiting parallelism efficiently. Finally, I will describe a system called Galois based on these ideas for exploiting amorphous data-parallelism on multicores and GPUs

  11. Parallelized direct execution simulation of message-passing parallel programs

    NASA Technical Reports Server (NTRS)

    Dickens, Phillip M.; Heidelberger, Philip; Nicol, David M.

    1994-01-01

    As massively parallel computers proliferate, there is growing interest in findings ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing computers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, Large Application Parallel Simulation Environment (LAPSE), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well typically within 10 percent relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.

  12. Genetic Parallel Programming: design and implementation.

    PubMed

    Cheang, Sin Man; Leung, Kwong Sak; Lee, Kin Hong

    2006-01-01

    This paper presents a novel Genetic Parallel Programming (GPP) paradigm for evolving parallel programs running on a Multi-Arithmetic-Logic-Unit (Multi-ALU) Processor (MAP). The MAP is a Multiple Instruction-streams, Multiple Data-streams (MIMD), general-purpose register machine that can be implemented on modern Very Large-Scale Integrated Circuits (VLSIs) in order to evaluate genetic programs at high speed. For human programmers, writing parallel programs is more difficult than writing sequential programs. However, experimental results show that GPP evolves parallel programs with less computational effort than that of their sequential counterparts. It creates a new approach to evolving a feasible problem solution in parallel program form and then serializes it into a sequential program if required. The effectiveness and efficiency of GPP are investigated using a suite of 14 well-studied benchmark problems. Experimental results show that GPP speeds up evolution substantially.

  13. Parallel programming with PCN. Revision 1

    SciTech Connect

    Foster, I.; Tuecke, S.

    1991-12-01

    PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and C that allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. In includes both tutorial and reference material. It also presents the basic concepts that underly PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous FTP from Argonne National Laboratory in the directory pub/pcn at info.mcs.anl.gov (c.f. Appendix A).

  14. Towards Distributed Memory Parallel Program Analysis

    SciTech Connect

    Quinlan, D; Barany, G; Panas, T

    2008-06-17

    This paper presents a parallel attribute evaluation for distributed memory parallel computer architectures where previously only shared memory parallel support for this technique has been developed. Attribute evaluation is a part of how attribute grammars are used for program analysis within modern compilers. Within this work, we have extended ROSE, a open compiler infrastructure, with a distributed memory parallel attribute evaluation mechanism to support user defined global program analysis required for some forms of security analysis which can not be addressed by a file by file view of large scale applications. As a result, user defined security analyses may now run in parallel without the user having to specify the way data is communicated between processors. The automation of communication enables an extensible open-source parallel program analysis infrastructure.

  15. A survey of parallel programming tools

    NASA Technical Reports Server (NTRS)

    Cheng, Doreen Y.

    1991-01-01

    This survey examines 39 parallel programming tools. Focus is placed on those tool capabilites needed for parallel scientific programming rather than for general computer science. The tools are classified with current and future needs of Numerical Aerodynamic Simulator (NAS) in mind: existing and anticipated NAS supercomputers and workstations; operating systems; programming languages; and applications. They are divided into four categories: suggested acquisitions, tools already brought in; tools worth tracking; and tools eliminated from further consideration at this time.

  16. Parallel programming interface for distributed data

    NASA Astrophysics Data System (ADS)

    Wang, Manhui; May, Andrew J.; Knowles, Peter J.

    2009-12-01

    The Parallel Programming Interface for Distributed Data (PPIDD) library provides an interface, suitable for use in parallel scientific applications, that delivers communications and global data management. The library can be built either using the Global Arrays (GA) toolkit, or a standard MPI-2 library. This abstraction allows the programmer to write portable parallel codes that can utilise the best, or only, communications library that is available on a particular computing platform. Program summaryProgram title: PPIDD Catalogue identifier: AEEF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEEF_1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 17 698 No. of bytes in distributed program, including test data, etc.: 166 173 Distribution format: tar.gz Programming language: Fortran, C Computer: Many parallel systems Operating system: Various Has the code been vectorised or parallelized?: Yes. 2-256 processors used RAM: 50 Mbytes Classification: 6.5 External routines: Global Arrays or MPI-2 Nature of problem: Many scientific applications require management and communication of data that is global, and the standard MPI-2 protocol provides only low-level methods for the required one-sided remote memory access. Solution method: The Parallel Programming Interface for Distributed Data (PPIDD) library provides an interface, suitable for use in parallel scientific applications, that delivers communications and global data management. The library can be built either using the Global Arrays (GA) toolkit, or a standard MPI-2 library. This abstraction allows the programmer to write portable parallel codes that can utilise the best, or only, communications library that is available on a particular computing platform. Running time: Problem dependent. The test provided with

  17. Support for Debugging Automatically Parallelized Programs

    NASA Technical Reports Server (NTRS)

    Hood, Robert; Jost, Gabriele

    2001-01-01

    This viewgraph presentation provides information on support sources available for the automatic parallelization of computer program. CAPTools, a support tool developed at the University of Greenwich, transforms, with user guidance, existing sequential Fortran code into parallel message passing code. Comparison routines are then run for debugging purposes, in essence, ensuring that the code transformation was accurate.

  18. Hybrid parallel programming with MPI and Unified Parallel C.

    SciTech Connect

    Dinan, J.; Balaji, P.; Lusk, E.; Sadayappan, P.; Thakur, R.; Mathematics and Computer Science; The Ohio State Univ.

    2010-01-01

    The Message Passing Interface (MPI) is one of the most widely used programming models for parallel computing. However, the amount of memory available to an MPI process is limited by the amount of local memory within a compute node. Partitioned Global Address Space (PGAS) models such as Unified Parallel C (UPC) are growing in popularity because of their ability to provide a shared global address space that spans the memories of multiple compute nodes. However, taking advantage of UPC can require a large recoding effort for existing parallel applications. In this paper, we explore a new hybrid parallel programming model that combines MPI and UPC. This model allows MPI programmers incremental access to a greater amount of memory, enabling memory-constrained MPI codes to process larger data sets. In addition, the hybrid model offers UPC programmers an opportunity to create static UPC groups that are connected over MPI. As we demonstrate, the use of such groups can significantly improve the scalability of locality-constrained UPC codes. This paper presents a detailed description of the hybrid model and demonstrates its effectiveness in two applications: a random access benchmark and the Barnes-Hut cosmological simulation. Experimental results indicate that the hybrid model can greatly enhance performance; using hybrid UPC groups that span two cluster nodes, RA performance increases by a factor of 1.33 and using groups that span four cluster nodes, Barnes-Hut experiences a twofold speedup at the expense of a 2% increase in code size.

  19. Integrated Task and Data Parallel Programming

    NASA Technical Reports Server (NTRS)

    Grimshaw, A. S.

    1998-01-01

    This research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers 1995 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program. Additional 1995 Activities During the fall I collaborated

  20. The PISCES 2 parallel programming environment

    NASA Technical Reports Server (NTRS)

    Pratt, Terrence W.

    1987-01-01

    PISCES 2 is a programming environment for scientific and engineering computations on MIMD parallel computers. It is currently implemented on a flexible FLEX/32 at NASA Langley, a 20 processor machine with both shared and local memories. The environment provides an extended Fortran for applications programming, a configuration environment for setting up a run on the parallel machine, and a run-time environment for monitoring and controlling program execution. This paper describes the overall design of the system and its implementation on the FLEX/32. Emphasis is placed on several novel aspects of the design: the use of a carefully defined virtual machine, programmer control of the mapping of virtual machine to actual hardware, forces for medium-granularity parallelism, and windows for parallel distribution of data. Some preliminary measurements of storage use are included.

  1. Parallel execution of Lisp programs. Doctoral thesis

    SciTech Connect

    Weening, J.S.

    1989-06-01

    This dissertation considers several issues in the execution of Lisp programs on shared-memory multiprocessors. An overview of constructs for explicit parallelism in Lisp is first presented. The problem of partitioning a program into process and scheduling these processes are then described, and a number of methods for performing these are proposed. These include cutting off process creation based on properties of the computation tree of the program, and basing partitioning decisions on the state of the system at runtime instead of the program. An experimental study of these methods has been performed using a simulator for parallel Lisp. This is followed by a description of the experiments that were performed and an analysis of the results. Two programs are used as illustrations-a Fast Fourier Transform, which has an abundance of parallelism, and the Cocke-Younger-Kasami parsing algorithm, for which good speedup is not as easy to obtain. The difficulty of using cutoff-based partitioning methods, and the differences between varios scheduling methods, are shown. A combination of partitioning and scheduling methods which we call dynamic partitioning is analyzed in more detail. This method is based on examining the machine's runtime state; it requires that the programmer only identify parallelism in the program, without deciding which potential parallelism is actually useful. We conclude that for programs whose computation trees have small height relative to their total size, dynamic partitioning can achieve asymptotically minimal overhead in the cost of process creation.

  2. Parallel programming with PCN. Revision 2

    SciTech Connect

    Foster, I.; Tuecke, S.

    1993-01-01

    PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and Cthat allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. It includes both tutorial and reference material. It also presents the basic concepts that underlie PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous ftp from Argonne National Laboratory in the directory pub/pcn at info.mcs. ani.gov (cf. Appendix A). This version of this document describes PCN version 2.0, a major revision of the PCN programming system. It supersedes earlier versions of this report.

  3. Portable parallel programming in a Fortran environment

    SciTech Connect

    May, E.N.

    1989-01-01

    Experience using the Argonne-developed PARMACs macro package to implement a portable parallel programming environment is described. Fortran programs with intrinsic parallelism of coarse and medium granularity are easily converted to parallel programs which are portable among a number of commercially available parallel processors in the class of shared-memory bus-based and local-memory network based MIMD processors. The parallelism is implemented using standard UNIX (tm) tools and a small number of easily understood synchronization concepts (monitors and message-passing techniques) to construct and coordinate multiple cooperating processes on one or many processors. Benchmark results are presented for parallel computers such as the Alliant FX/8, the Encore MultiMax, the Sequent Balance, the Intel iPSC/2 Hypercube and a network of Sun 3 workstations. These parallel machines are typical MIMD types with from 8 to 30 processors, each rated at from 1 to 10 MIPS processing power. The demonstration code used for this work is a Monte Carlo simulation of the response to photons of a ''nearly realistic'' lead, iron and plastic electromagnetic and hadronic calorimeter, using the EGS4 code system. 6 refs., 2 figs., 2 tabs.

  4. Communication Graph Generator for Parallel Programs

    SciTech Connect

    2014-04-08

    Graphator is a collection of relatively simple sequential programs that generate communication graphs/matrices for commonly occurring patterns in parallel programs. Currently, there is support for five communication patterns: two-dimensional 4-point stencil, four-dimensional 8-point stencil, all-to-alls over sub-communicators, random near-neighbor communication, and near-neighbor communication.

  5. Optics Program Modified for Multithreaded Parallel Computing

    NASA Technical Reports Server (NTRS)

    Lou, John; Bedding, Dave; Basinger, Scott

    2006-01-01

    A powerful high-performance computer program for simulating and analyzing adaptive and controlled optical systems has been developed by modifying the serial version of the Modeling and Analysis for Controlled Optical Systems (MACOS) program to impart capabilities for multithreaded parallel processing on computing systems ranging from supercomputers down to Symmetric Multiprocessing (SMP) personal computers. The modifications included the incorporation of OpenMP, a portable and widely supported application interface software, that can be used to explicitly add multithreaded parallelism to an application program under a shared-memory programming model. OpenMP was applied to parallelize ray-tracing calculations, one of the major computing components in MACOS. Multithreading is also used in the diffraction propagation of light in MACOS based on pthreads [POSIX Thread, (where "POSIX" signifies a portable operating system for UNIX)]. In tests of the parallelized version of MACOS, the speedup in ray-tracing calculations was found to be linear, or proportional to the number of processors, while the speedup in diffraction calculations ranged from 50 to 60 percent, depending on the type and number of processors. The parallelized version of MACOS is portable, and, to the user, its interface is basically the same as that of the original serial version of MACOS.

  6. Plasmonics and the parallel programming problem

    NASA Astrophysics Data System (ADS)

    Vishkin, Uzi; Smolyaninov, Igor; Davis, Chris

    2007-02-01

    While many parallel computers have been built, it has generally been too difficult to program them. Now, all computers are effectively becoming parallel machines. Biannual doubling in the number of cores on a single chip, or faster, over the coming decade is planned by most computer vendors. Thus, the parallel programming problem is becoming more critical. The only known solution to the parallel programming problem in the theory of computer science is through a parallel algorithmic theory called PRAM. Unfortunately, some of the PRAM theory assumptions regarding the bandwidth between processors and memories did not properly reflect a parallel computer that could be built in previous decades. Reaching memories, or other processors in a multi-processor organization, required off-chip connections through pins on the boundary of each electric chip. Using the number of transistors that is becoming available on chip, on-chip architectures that adequately support the PRAM are becoming possible. However, the bandwidth of off-chip connections remains insufficient and the latency remains too high. This creates a bottleneck at the boundary of the chip for a PRAM-On-Chip architecture. This also prevents scalability to larger "supercomputing" organizations spanning across many processing chips that can handle massive amounts of data. Instead of connections through pins and wires, power-efficient CMOS-compatible on-chip conversion to plasmonic nanowaveguides is introduced for improved latency and bandwidth. Proper incorporation of our ideas offer exciting avenues to resolving the parallel programming problem, and an alternative way for building faster, more useable and much more compact supercomputers.

  7. Simulating Billion-Task Parallel Programs

    SciTech Connect

    Perumalla, Kalyan S; Park, Alfred J

    2014-01-01

    In simulating large parallel systems, bottom-up approaches exercise detailed hardware models with effects from simplified software models or traces, whereas top-down approaches evaluate the timing and functionality of detailed software models over coarse hardware models. Here, we focus on the top-down approach and significantly advance the scale of the simulated parallel programs. Via the direct execution technique combined with parallel discrete event simulation, we stretch the limits of the top-down approach by simulating message passing interface (MPI) programs with millions of tasks. Using a timing-validated benchmark application, a proof-of-concept scaling level is achieved to over 0.22 billion virtual MPI processes on 216,000 cores of a Cray XT5 supercomputer, representing one of the largest direct execution simulations to date, combined with a multiplexing ratio of 1024 simulated tasks per real task.

  8. Parallel Volunteer Learning during Youth Programs

    ERIC Educational Resources Information Center

    Lesmeister, Marilyn K.; Green, Jeremy; Derby, Amy; Bothum, Candi

    2012-01-01

    Lack of time is a hindrance for volunteers to participate in educational opportunities, yet volunteer success in an organization is tied to the orientation and education they receive. Meeting diverse educational needs of volunteers can be a challenge for program managers. Scheduling a Volunteer Learning Track for chaperones that is parallel to a…

  9. Program For Parallel Discrete-Event Simulation

    NASA Technical Reports Server (NTRS)

    Beckman, Brian C.; Blume, Leo R.; Geiselman, John S.; Presley, Matthew T.; Wedel, John J., Jr.; Bellenot, Steven F.; Diloreto, Michael; Hontalas, Philip J.; Reiher, Peter L.; Weiland, Frederick P.

    1991-01-01

    User does not have to add any special logic to aid in synchronization. Time Warp Operating System (TWOS) computer program is special-purpose operating system designed to support parallel discrete-event simulation. Complete implementation of Time Warp mechanism. Supports only simulations and other computations designed for virtual time. Time Warp Simulator (TWSIM) subdirectory contains sequential simulation engine interface-compatible with TWOS. TWOS and TWSIM written in, and support simulations in, C programming language.

  10. Relative Debugging of Automatically Parallelized Programs

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Hood, Robert; Biegel, Bryan (Technical Monitor)

    2002-01-01

    We describe a system that simplifies the process of debugging programs produced by computer-aided parallelization tools. The system uses relative debugging techniques to compare serial and parallel executions in order to show where the computations begin to differ. If the original serial code is correct, errors due to parallelization will be isolated by the comparison. One of the primary goals of the system is to minimize the effort required of the user. To that end, the debugging system uses information produced by the parallelization tool to drive the comparison process. In particular, the debugging system relies on the parallelization tool to provide information about where variables may have been modified and how arrays are distributed across multiple processes. User effort is also reduced through the use of dynamic instrumentation. This allows us to modify, the program execution with out changing the way the user builds the executable. The use of dynamic instrumentation also permits us to compare the executions in a fine-grained fashion and only involve the debugger when a difference has been detected. This reduces the overhead of executing instrumentation.

  11. Support for Debugging Automatically Parallelized Programs

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Hood, Robert; Biegel, Bryan (Technical Monitor)

    2001-01-01

    We describe a system that simplifies the process of debugging programs produced by computer-aided parallelization tools. The system uses relative debugging techniques to compare serial and parallel executions in order to show where the computations begin to differ. If the original serial code is correct, errors due to parallelization will be isolated by the comparison. One of the primary goals of the system is to minimize the effort required of the user. To that end, the debugging system uses information produced by the parallelization tool to drive the comparison process. In particular the debugging system relies on the parallelization tool to provide information about where variables may have been modified and how arrays are distributed across multiple processes. User effort is also reduced through the use of dynamic instrumentation. This allows us to modify the program execution without changing the way the user builds the executable. The use of dynamic instrumentation also permits us to compare the executions in a fine-grained fashion and only involve the debugger when a difference has been detected. This reduces the overhead of executing instrumentation.

  12. Concurrency-based approaches to parallel programming

    SciTech Connect

    Kale, L.V.; Chrisochoides, N.; Kohl, J.

    1995-07-17

    The inevitable transition to parallel programming can be facilitated by appropriate tools, including languages and libraries. After describing the needs of applications developers, this paper presents three specific approaches aimed at development of efficient and reusable parallel software for irregular and dynamic-structured problems. A salient feature of all three approaches in their exploitation of concurrency within a processor. Benefits of individual approaches such as these can be leveraged by an interoperability environment which permits modules written using different approaches to co-exist in single applications.

  13. Concurrency-based approaches to parallel programming

    NASA Technical Reports Server (NTRS)

    Kale, L.V.; Chrisochoides, N.; Kohl, J.; Yelick, K.

    1995-01-01

    The inevitable transition to parallel programming can be facilitated by appropriate tools, including languages and libraries. After describing the needs of applications developers, this paper presents three specific approaches aimed at development of efficient and reusable parallel software for irregular and dynamic-structured problems. A salient feature of all three approaches in their exploitation of concurrency within a processor. Benefits of individual approaches such as these can be leveraged by an interoperability environment which permits modules written using different approaches to co-exist in single applications.

  14. Array distribution in data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Sheffler, Thomas J.

    1994-01-01

    We consider distribution at compile time of the array data in a distributed-memory implementation of a data-parallel program written in a language like Fortran 90. We allow dynamic redistribution of data and define a heuristic algorithmic framework that chooses distribution parameters to minimize an estimate of program completion time. We represent the program as an alignment-distribution graph. We propose a divide-and-conquer algorithm for distribution that initially assigns a common distribution to each node of the graph and successively refines this assignment, taking computation, realignment, and redistribution costs into account. We explain how to estimate the effect of distribution on computation cost and how to choose a candidate set of distributions. We present the results of an implementation of our algorithms on several test problems.

  15. Automatic Generation of Directive-Based Parallel Programs for Shared Memory Parallel Systems

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Yan, Jerry; Frumkin, Michael

    2000-01-01

    The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. As great progress was made in hardware and software technologies, performance of parallel programs with compiler directives has demonstrated large improvement. The introduction of OpenMP directives, the industrial standard for shared-memory programming, has minimized the issue of portability. Due to its ease of programming and its good performance, the technique has become very popular. In this study, we have extended CAPTools, a computer-aided parallelization toolkit, to automatically generate directive-based, OpenMP, parallel programs. We outline techniques used in the implementation of the tool and present test results on the NAS parallel benchmarks and ARC3D, a CFD application. This work demonstrates the great potential of using computer-aided tools to quickly port parallel programs and also achieve good performance.

  16. Parallel Programming Strategies for Irregular Adaptive Applications

    NASA Technical Reports Server (NTRS)

    Biswas, Rupak; Biegel, Bryan (Technical Monitor)

    2001-01-01

    Achieving scalable performance for dynamic irregular applications is eminently challenging. Traditional message-passing approaches have been making steady progress towards this goal; however, they suffer from complex implementation requirements. The use of a global address space greatly simplifies the programming task, but can degrade the performance for such computations. In this work, we examine two typical irregular adaptive applications, Dynamic Remeshing and N-Body, under competing programming methodologies and across various parallel architectures. The Dynamic Remeshing application simulates flow over an airfoil, and refines localized regions of the underlying unstructured mesh. The N-Body experiment models two neighboring Plummer galaxies that are about to undergo a merger. Both problems demonstrate dramatic changes in processor workloads and interprocessor communication with time; thus, dynamic load balancing is a required component.

  17. Flexible language constructs for large parallel programs

    NASA Technical Reports Server (NTRS)

    Rosing, Matthew; Schnabel, Robert

    1993-01-01

    The goal of the research described is to develop flexible language constructs for writing large data parallel numerical programs for distributed memory (MIMD) multiprocessors. Previously, several models have been developed to support synchronization and communication. Models for global synchronization include SIMD (Single Instruction Multiple Data), SPMD (Single Program Multiple Data), and sequential programs annotated with data distribution statements. The two primary models for communication include implicit communication based on shared memory and explicit communication based on messages. None of these models by themselves seem sufficient to permit the natural and efficient expression of the variety of algorithms that occur in large scientific computations. An overview of a new language that combines many of these programming models in a clean manner is given. This is done in a modular fashion such that different models can be combined to support large programs. Within a module, the selection of a model depends on the algorithm and its efficiency requirements. An overview of the language and discussion of some of the critical implementation details is given.

  18. Flexible Language Constructs for Large Parallel Programs

    DOE PAGES

    Rosing, Matt; Schnabel, Robert

    1994-01-01

    The goal of the research described in this article is to develop flexible language constructs for writing large data parallel numerical programs for distributed memory (multiple instruction multiple data [MIMD]) multiprocessors. Previously, several models have been developed to support synchronization and communication. Models for global synchronization include single instruction multiple data (SIMD), single program multiple data (SPMD), and sequential programs annotated with data distribution statements. The two primary models for communication include implicit communication based on shared memory and explicit communication based on messages. None of these models by themselves seem sufficient to permit the natural and efficient expression ofmore » the variety of algorithms that occur in large scientific computations. In this article, we give an overview of a new language that combines many of these programming models in a clean manner. This is done in a modular fashion such that different models can be combined to support large programs. Within a module, the selection of a model depends on the algorithm and its efficiency requirements. In this article, we give an overview of the language and discuss some of the critical implementation details.« less

  19. Filter Circuit Design by Parallel Genetic Programming

    NASA Astrophysics Data System (ADS)

    Yano, Yuichi; Kato, Toshiji; Inoue, Kaoru; Miki, Mitsunori

    Genetic Programming (GP) is an extension of Genetic Algorithm(GA) to handle more structural problems. In this paper, an approach to filter circuit design by GP is proposed. By designing a gene which includes not only the parameters of consisting elements, but also the structural information of the circuit, it becomes possible to apply the proposed approach to various types of filter circuits. GP depends much on trial and error due to its probabilitic nature. To decrease this uncertainty and ensure the progress of the evolution, Parallel GP with multiple populations with the island model is also proposed. An MPI-based cluster system is used for realization of this parallel computing where each island correspondsd to each node. A lowpass and an asymmetric bandpass filters are designed. One hundred times of trials for multiple populations with and without migrations are tested in the design of lowpass filter to confirm the validity of the proposed method. In the asymmetric bandpass filter design, the results are compared with those of the circuit designed by hand to confirm the effectiveness of the proposed method. The proposed approach is applicable to various types of filter circuits. It can contribute to an automated design procedure, where it would require a expirenced designer if done by hand. It is also possible to obtain a new circuit design which would not be possible if done by hand.

  20. Grundy: Parallel Processor Architecture Makes Programming Easy

    NASA Astrophysics Data System (ADS)

    Meier, Robert J.

    1985-12-01

    Grundy, an architecture for parallel processing, facilitates the use of high-level languages. In Grundy, several thousand simple processors are dispersed throughout the address space and the concept of machine state is replaced by an invokation frame, a data structure of local variables, program counter, and pointers to superprocesses (parents), subprocesses (children), and concurrent processes (siblings). Each instruction execution consists of five phases. An instruction is fetched, the instruction is decoded, the sources are fetched, the operation is performed, and the destination is written. This breakdown of operations is easily pipelinable. The instruction format of Grundy is completely orthogonal, so Grundy machine code consists of a set of register transfer control bits. The process state pointers are used to collect unused resources such as processors and memory. Joseph Mahon[1] found that as the degree of physical parallelism increases, throughput, including overhead, increases even if extra overhead is needed to split logical processes. As stack pointer, accumulators, and index registers facilitate using high-level languages on conventional computers, pointers to parents, children, and siblings simplify the use of a run-time operating system. The ability to ignore the physical structure of a large number of simple processors supports the use of structured programming. A very simple processor cell allows the replication of approximately 16 32-bit processors on a single Very Large Scale Integration chip. (2M lambda[2]) A bootstrapper and Input/Output channels can be hardwired (using ROM cells and pseudo-processor cells) into a 100 chip computer that is expected to have over 500 procesors, 500K memory, and a network supporting up to 64 concurrent messages between 1000 nodes. These sizes are merely typical and not limits.

  1. Parallel solution of sparse one-dimensional dynamic programming problems

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1989-01-01

    Parallel computation offers the potential for quickly solving large computational problems. However, it is often a non-trivial task to effectively use parallel computers. Solution methods must sometimes be reformulated to exploit parallelism; the reformulations are often more complex than their slower serial counterparts. We illustrate these points by studying the parallelization of sparse one-dimensional dynamic programming problems, those which do not obviously admit substantial parallelization. We propose a new method for parallelizing such problems, develop analytic models which help us to identify problems which parallelize well, and compare the performance of our algorithm with existing algorithms on a multiprocessor.

  2. Parallel phase model : a programming model for high-end parallel machines with manycores.

    SciTech Connect

    Wu, Junfeng; Wen, Zhaofang; Heroux, Michael Allen; Brightwell, Ronald Brian

    2009-04-01

    This paper presents a parallel programming model, Parallel Phase Model (PPM), for next-generation high-end parallel machines based on a distributed memory architecture consisting of a networked cluster of nodes with a large number of cores on each node. PPM has a unified high-level programming abstraction that facilitates the design and implementation of parallel algorithms to exploit both the parallelism of the many cores and the parallelism at the cluster level. The programming abstraction will be suitable for expressing both fine-grained and coarse-grained parallelism. It includes a few high-level parallel programming language constructs that can be added as an extension to an existing (sequential or parallel) programming language such as C; and the implementation of PPM also includes a light-weight runtime library that runs on top of an existing network communication software layer (e.g. MPI). Design philosophy of PPM and details of the programming abstraction are also presented. Several unstructured applications that inherently require high-volume random fine-grained data accesses have been implemented in PPM with very promising results.

  3. An interactive parallel programming environment applied in atmospheric science

    SciTech Connect

    Laszewski, G. von

    1996-12-31

    This article introduces an interactive parallel programming environment (IPPE) that simplifies the generation and execution of parallel programs. One of the tasks of the environment is to generate message-passing parallel programs for homogeneous and heterogeneous computing platforms. The parallel programs are represented by using visual objects. This is accomplished with the help of a graphical programming editor that is implemented in Java and enables portability to a wide variety of computer platforms. In contrast to other graphical programming systems, reusable parts of the programs can be stored in a program library to support rapid prototyping. In addition, runtime performance data on different computing platforms is collected in a database. A selection process determines dynamically the software and the hardware platform to be used to solve the problem in minimal wall-clock time. The environment is currently being tested on a Grand Challenge problem, the NASA four-dimensional data assimilation system.

  4. An interactive parallel programming environment applied in atmospheric science

    NASA Technical Reports Server (NTRS)

    vonLaszewski, G.

    1996-01-01

    This article introduces an interactive parallel programming environment (IPPE) that simplifies the generation and execution of parallel programs. One of the tasks of the environment is to generate message-passing parallel programs for homogeneous and heterogeneous computing platforms. The parallel programs are represented by using visual objects. This is accomplished with the help of a graphical programming editor that is implemented in Java and enables portability to a wide variety of computer platforms. In contrast to other graphical programming systems, reusable parts of the programs can be stored in a program library to support rapid prototyping. In addition, runtime performance data on different computing platforms is collected in a database. A selection process determines dynamically the software and the hardware platform to be used to solve the problem in minimal wall-clock time. The environment is currently being tested on a Grand Challenge problem, the NASA four-dimensional data assimilation system.

  5. Programming parallel architectures - The BLAZE family of languages

    NASA Technical Reports Server (NTRS)

    Mehrotra, Piyush

    1989-01-01

    This paper gives an overview of the various approaches to programming multiprocessor architectures that are currently being explored. It is argued that two of these approaches, interactive programming environments and functional parallel languages, are particularly attractive, since they remove much of the burden of exploiting parallel architectures from the user. This paper also describes recent work in the design of parallel languages. Research on languages for both shared and nonshared memory multiprocessors is described.

  6. The BLAZE language: A parallel language for scientific programming

    NASA Technical Reports Server (NTRS)

    Mehrotra, P.; Vanrosendale, J.

    1985-01-01

    A Pascal-like scientific programming language, Blaze, is described. Blaze contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus Blaze should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with onceptually sequential control flow. A central goal in the design of Blaze is portability across a broad range of parallel architectures. The multiple levels of parallelism present in Blaze code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of Blaze are described and shows how this language would be used in typical scientific programming.

  7. The BLAZE language - A parallel language for scientific programming

    NASA Technical Reports Server (NTRS)

    Mehrotra, Piyush; Van Rosendale, John

    1987-01-01

    A Pascal-like scientific programming language, BLAZE, is described. BLAZE contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus BLAZE should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with conceptually sequential control flow. A central goal in the design of BLAZE is portability across a broad range of parallel architectures. The multiple levels of parallelism present in BLAZE code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of BLAZE are described and it is shown how this language would be used in typical scientific programming.

  8. A Parallel Program Analysis Framework for the ACTS Toolkit

    SciTech Connect

    Allen D. Malony

    2002-06-21

    OAK 270 - The final report summarizes the technical progress achieved during the project. A Parallel Program Analysis Framework for the acts toolkit, referred to as the TAU project. Described are the results in four work areas: (1) creation of a performance system for integrated instrumentation, measurement, analysis and visualization. (2) development of a performance measurement system for parallel profiling and tracing (3) development of an advanced program analysis system to enable creation of source-based performance and programing tools (4) development of parallel program interaction technology for accessing, performance information and application data during execution.

  9. Parallelization of the CI Program PEDICI

    NASA Astrophysics Data System (ADS)

    Thorsteinsson, Thorstein; Rettrup, Sten

    The general CI code PEDICI has been parallelized by decomposing the occurring summation over two-electron integrals. The parallelization was formulated in terms of a "master/slave'' model, and realized through use of the "PVM'' message passing facility. We have aimed at achieving a reasonably simple implementation for use on machines with intermediate numbers of processors. Exploratory test runs on an IBM SP supercomputer (consisting of RS/6000 model P2SC (120 MHz) nodes) show a very satisfactory performance increase with the number of processors used, as well as encouraging balancing of the workload. Our largest 32-processor test case gives a speed-up factor of 30.27.

  10. Directions in parallel programming: HPF, shared virtual memory and object parallelism in pC++

    NASA Technical Reports Server (NTRS)

    Bodin, Francois; Priol, Thierry; Mehrotra, Piyush; Gannon, Dennis

    1994-01-01

    Fortran and C++ are the dominant programming languages used in scientific computation. Consequently, extensions to these languages are the most popular for programming massively parallel computers. We discuss two such approaches to parallel Fortran and one approach to C++. The High Performance Fortran Forum has designed HPF with the intent of supporting data parallelism on Fortran 90 applications. HPF works by asking the user to help the compiler distribute and align the data structures with the distributed memory modules in the system. Fortran-S takes a different approach in which the data distribution is managed by the operating system and the user provides annotations to indicate parallel control regions. In the case of C++, we look at pC++ which is based on a concurrent aggregate parallel model.

  11. Language constructs and runtime systems for compositional parallel programming

    SciTech Connect

    Foster, I.; Kesselman, C.

    1995-03-01

    In task-parallel programs, diverse activities can take place concurrently, and communication and synchronization patterns are complex and not easily predictable. Previous work has identified compositionality as an important design principle for task-parallel programs. In this paper, we discuss alternative approaches to the realization of this principle. We first provide a review and critical analysis of Strand, an early compositional programming language. We examine the strengths of the Strand approach and also its weaknesses, which we attribute primarily to the use of a specialized language. Then, we present an alternative programming language framework that overcomes these weaknesses. This framework uses simple extensions to existing sequential languages (C++ and Fortran) and a common runtime system to provide a basis for the construction of large, task-parallel programs. We also discuss the runtime system techniques required to support these languages on parallel and distributed computer systems.

  12. Message based event specification for debugging nondeterministic parallel programs

    SciTech Connect

    Damohdaran-Kamal, S.K.; Francioni, J.M.

    1995-02-01

    Portability and reliability of parallel programs can be severely impaired by their nondeterministic behavior. Therefore, an effective means to precisely and accurately specify unacceptable nondeterministic behavior is necessary for testing and debugging parallel programs. In this paper we describe a class of expressions, called Message Expressions that can be used to specify nondeterministic behavior of message passing parallel programs. Specification of program behavior with Message Expressions is easier than pattern based specification techniques in that the former does not require knowledge of run-time event order, whereas that later depends on the user`s knowledge of the run-time event order for correct specification. We also discuss our adaptation of Message Expressions for use in a dynamic distributed testing and debugging tool, called mdb, for programs written for PVM (Parallel Virtual Machine).

  13. Programming parallel architectures: The BLAZE family of languages

    NASA Technical Reports Server (NTRS)

    Mehrotra, Piyush

    1988-01-01

    Programming multiprocessor architectures is a critical research issue. An overview is given of the various approaches to programming these architectures that are currently being explored. It is argued that two of these approaches, interactive programming environments and functional parallel languages, are particularly attractive since they remove much of the burden of exploiting parallel architectures from the user. Also described is recent work by the author in the design of parallel languages. Research on languages for both shared and nonshared memory multiprocessors is described, as well as the relations of this work to other current language research projects.

  14. Parallel Molecular Dynamics Program for Molecules

    SciTech Connect

    Plimpton, Steve

    1995-03-07

    ParBond is a parallel classical molecular dynamics code that models bonded molecular systems, typically of an organic nature. It uses classical force fields for both non-bonded Coulombic and Van der Waals interactions and for 2-, 3-, and 4-body bonded (bond, angle, dihedral, and improper) interactions. It integrates Newton''s equation of motion for the molecular system and evaluates various thermodynamical properties of the system as it progresses.

  15. Programming Probabilistic Structural Analysis for Parallel Processing Computer

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Chen, Heh-Chyun; Twisdale, Lawrence A.; Chamis, Christos C.; Murthy, Pappu L. N.

    1991-01-01

    The ultimate goal of this research program is to make Probabilistic Structural Analysis (PSA) computationally efficient and hence practical for the design environment by achieving large scale parallelism. The paper identifies the multiple levels of parallelism in PSA, identifies methodologies for exploiting this parallelism, describes the development of a parallel stochastic finite element code, and presents results of two example applications. It is demonstrated that speeds within five percent of those theoretically possible can be achieved. A special-purpose numerical technique, the stochastic preconditioned conjugate gradient method, is also presented and demonstrated to be extremely efficient for certain classes of PSA problems.

  16. The FORCE - A highly portable parallel programming language

    NASA Technical Reports Server (NTRS)

    Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger

    1989-01-01

    This paper explains why the FORCE parallel programming language is easily portable among six different shared-memory multiprocessors, and how a two-level macro preprocessor makes it possible to hide low-level machine dependencies and to build machine-independent high-level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared-memory multiprocessor executing them.

  17. The FORCE: A highly portable parallel programming language

    NASA Technical Reports Server (NTRS)

    Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger

    1989-01-01

    Here, it is explained why the FORCE parallel programming language is easily portable among six different shared-memory microprocessors, and how a two-level macro preprocessor makes it possible to hide low level machine dependencies and to build machine-independent high level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared memory multiprocessor executing them.

  18. Integrated Task And Data Parallel Programming: Language Design

    NASA Technical Reports Server (NTRS)

    Grimshaw, Andrew S.; West, Emily A.

    1998-01-01

    his research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers '95 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program m. Additional 1995 Activities During the fall I collaborated

  19. Characterizing and Mitigating Work Time Inflation in Task Parallel Programs

    DOE PAGES

    Olivier, Stephen L.; de Supinski, Bronis R.; Schulz, Martin; Prins, Jan F.

    2013-01-01

    Task parallelism raises the level of abstraction in shared memory parallel programming to simplify the development of complex applications. However, task parallel applications can exhibit poor performance due to thread idleness, scheduling overheads, and work time inflation – additional time spent by threads in a multithreaded computation beyond the time required to perform the same work in a sequential computation. We identify the contributions of each factor to lost efficiency in various task parallel OpenMP applications and diagnose the causes of work time inflation in those applications. Increased data access latency can cause significant work time inflation in NUMAmore » systems. Our locality framework for task parallel OpenMP programs mitigates this cause of work time inflation. Our extensions to the Qthreads library demonstrate that locality-aware scheduling can improve performance up to 3X compared to the Intel OpenMP task scheduler.« less

  20. The interprocedural analysis and automatic parallelization of scheme programs

    SciTech Connect

    Harrison, W.L. III.

    1989-01-01

    Lisp and its descendants are among the most important and widely used of programming languages. At the same time, parallelism in the architecture of computer systems is becoming commonplace. There is a pressing need to extend the technology of automatic parallelization that has become available to Fortran programmers of parallel machines, to the realm of Lisp programs and symbolic computing. In this thesis the authors presents a comprehensive approach to the compilation of Scheme programs for share-memory multiprocessors. The strategy has two principal components: interprocedural analysis and program restructuring. He introduces procedure strings and stack configurations as a framework in which to reason about interprocedural side-effects and object lifetimes, and develop a system of interprocedural analysis, using abstract interpretation, that is used in the dependence analysis and memory management of Scheme programs. He introduces the transformations of exit-loop translation and recursion splitting to treat the control structures of iteration and recursion that arise commonly in Scheme programs. He proposes an alternative representation for s-expressions that facilitates the parallel creation and access of lists. He has implemented these ideas in a parallelizing scheme compiler and run-time system, and he complements the theory of the work with snapshots of programs during the restructuring process, and some preliminary performance results of the execution of object codes produced by the compiler.

  1. NavP: Structured and Multithreaded Distributed Parallel Programming

    NASA Technical Reports Server (NTRS)

    Pan, Lei; Xu, Jingling

    2006-01-01

    This slide presentation reviews some of the issues around distributed parallel programming. It compares and contrast two methods of programming: Single Program Multiple Data (SPMD) with the Navigational Programming (NAVP). It then reviews the distributed sequential computing (DSC) method and the methodology of NavP. Case studies are presented. It also reviews the work that is being done to enable the NavP system.

  2. Development of massively parallel quantum chemistry program SMASH

    NASA Astrophysics Data System (ADS)

    Ishimura, Kazuya

    2015-12-01

    A massively parallel program for quantum chemistry calculations SMASH was released under the Apache License 2.0 in September 2014. The SMASH program is written in the Fortran90/95 language with MPI and OpenMP standards for parallelization. Frequently used routines, such as one- and two-electron integral calculations, are modularized to make program developments simple. The speed-up of the B3LYP energy calculation for (C150H30)2 with the cc-pVDZ basis set (4500 basis functions) was 50,499 on 98,304 cores of the K computer.

  3. Development of massively parallel quantum chemistry program SMASH

    SciTech Connect

    Ishimura, Kazuya

    2015-12-31

    A massively parallel program for quantum chemistry calculations SMASH was released under the Apache License 2.0 in September 2014. The SMASH program is written in the Fortran90/95 language with MPI and OpenMP standards for parallelization. Frequently used routines, such as one- and two-electron integral calculations, are modularized to make program developments simple. The speed-up of the B3LYP energy calculation for (C{sub 150}H{sub 30}){sub 2} with the cc-pVDZ basis set (4500 basis functions) was 50,499 on 98,304 cores of the K computer.

  4. Web Based Parallel Programming Workshop for Undergraduate Education.

    ERIC Educational Resources Information Center

    Marcus, Robert L.; Robertson, Douglass

    Central State University (Ohio), under a contract with Nichols Research Corporation, has developed a World Wide web based workshop on high performance computing entitled "IBN SP2 Parallel Programming Workshop." The research is part of the DoD (Department of Defense) High Performance Computing Modernization Program. The research activities included…

  5. On program restructuring, scheduling, and communication for parallel processor systems

    SciTech Connect

    Polychronopoulos, Constantine D.

    1986-08-01

    This dissertation discusses several software and hardware aspects of program execution on large-scale, high-performance parallel processor systems. The issues covered are program restructuring, partitioning, scheduling and interprocessor communication, synchronization, and hardware design issues of specialized units. All this work was performed focusing on a single goal: to maximize program speedup, or equivalently, to minimize parallel execution time. Parafrase, a Fortran restructuring compiler was used to transform programs in a parallel form and conduct experiments. Two new program restructuring techniques are presented, loop coalescing and subscript blocking. Compile-time and run-time scheduling schemes are covered extensively. Depending on the program construct, these algorithms generate optimal or near-optimal schedules. For the case of arbitrarily nested hybrid loops, two optimal scheduling algorithms for dynamic and static scheduling are presented. Simulation results are given for a new dynamic scheduling algorithm. The performance of this algorithm is compared to that of self-scheduling. Techniques for program partitioning and minimization of interprocessor communication for idealized program models and for real Fortran programs are also discussed. The close relationship between scheduling, interprocessor communication, and synchronization becomes apparent at several points in this work. Finally, the impact of various types of overhead on program speedup and experimental results are presented. 69 refs., 74 figs., 14 tabs.

  6. Exploiting loop level parallelism in nonprocedural dataflow programs

    NASA Technical Reports Server (NTRS)

    Gokhale, Maya B.

    1987-01-01

    Discussed are how loop level parallelism is detected in a nonprocedural dataflow program, and how a procedural program with concurrent loops is scheduled. Also discussed is a program restructuring technique which may be applied to recursive equations so that concurrent loops may be generated for a seemingly iterative computation. A compiler which generates C code for the language described below has been implemented. The scheduling component of the compiler and the restructuring transformation are described.

  7. Deadlock and fictitiousness problem in parallel program specifications

    SciTech Connect

    Panfilenko, V.P.

    1995-05-01

    One of the directions of modern programming based on algebraic methods takes its origin in V.M. Glushkov`s theory of systems of algorithmic algebras (SAA). The SAA apparatus with appropriately interpreted operations is used for program design and allows compact structured representation of program schemas in the form of algebraic formulas. Modified systems of algorithmic algebras (SAA-M) additionally represent parallelism description tools.

  8. Incremental Parallelization of Non-Data-Parallel Programs Using the Charon Message-Passing Library

    NASA Technical Reports Server (NTRS)

    VanderWijngaart, Rob F.

    2000-01-01

    Message passing is among the most popular techniques for parallelizing scientific programs on distributed-memory architectures. The reasons for its success are wide availability (MPI), efficiency, and full tuning control provided to the programmer. A major drawback, however, is that incremental parallelization, as offered by compiler directives, is not generally possible, because all data structures have to be changed throughout the program simultaneously. Charon remedies this situation through mappings between distributed and non-distributed data. It allows breaking up the parallelization into small steps, guaranteeing correctness at every stage. Several tools are available to help convert legacy codes into high-performance message-passing programs. They usually target data-parallel applications, whose loops carrying most of the work can be distributed among all processors without much dependency analysis. Others do a full dependency analysis and then convert the code virtually automatically. Even more toolkits are available that aid construction from scratch of message passing programs. None, however, allows piecemeal translation of codes with complex data dependencies (i.e. non-data-parallel programs) into message passing codes. The Charon library (available in both C and Fortran) provides incremental parallelization capabilities by linking legacy code arrays with distributed arrays. During the conversion process, non-distributed and distributed arrays exist side by side, and simple mapping functions allow the programmer to switch between the two in any location in the program. Charon also provides wrapper functions that leave the structure of the legacy code intact, but that allow execution on truly distributed data. Finally, the library provides a rich set of communication functions that support virtually all patterns of remote data demands in realistic structured grid scientific programs, including transposition, nearest-neighbor communication, pipelining

  9. Execution models for mapping programs onto distributed memory parallel computers

    NASA Technical Reports Server (NTRS)

    Sussman, Alan

    1992-01-01

    The problem of exploiting the parallelism available in a program to efficiently employ the resources of the target machine is addressed. The problem is discussed in the context of building a mapping compiler for a distributed memory parallel machine. The paper describes using execution models to drive the process of mapping a program in the most efficient way onto a particular machine. Through analysis of the execution models for several mapping techniques for one class of programs, we show that the selection of the best technique for a particular program instance can make a significant difference in performance. On the other hand, the results of benchmarks from an implementation of a mapping compiler show that our execution models are accurate enough to select the best mapping technique for a given program.

  10. Center for Programming Models for Scalable Parallel Computing

    SciTech Connect

    John Mellor-Crummey

    2008-02-29

    Rice University's achievements as part of the Center for Programming Models for Scalable Parallel Computing include: (1) design and implemention of cafc, the first multi-platform CAF compiler for distributed and shared-memory machines, (2) performance studies of the efficiency of programs written using the CAF and UPC programming models, (3) a novel technique to analyze explicitly-parallel SPMD programs that facilitates optimization, (4) design, implementation, and evaluation of new language features for CAF, including communication topologies, multi-version variables, and distributed multithreading to simplify development of high-performance codes in CAF, and (5) a synchronization strength reduction transformation for automatically replacing barrier-based synchronization with more efficient point-to-point synchronization. The prototype Co-array Fortran compiler cafc developed in this project is available as open source software from http://www.hipersoft.rice.edu/caf.

  11. Deriving sequential and parallel programs from pure Lisp specifications by program transformation

    SciTech Connect

    Boyle, J.M.; Dritz, K.W.; Muralidharan, M.N.; Taylor, R.J.

    1986-01-01

    We describe an ''industrial strength'' example of the use of specification and transformation to produce an efficient program. The specification is written in pure Lisp. Program transformations have been used to generate from it a program that executes efficiently on a wide variety of sequential machines. Another set of transformations has been used to generate programs from it that execute efficiently on global-memory parallel machines. We discuss some of the design decisions for the parallel version (and their motivations), optimization of parallel programs, and the benefits of the specification-transformation approach.

  12. Modelling parallel programs and multiprocessor architectures with AXE

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Fineman, Charles E.

    1991-01-01

    AXE, An Experimental Environment for Parallel Systems, was designed to model and simulate for parallel systems at the process level. It provides an integrated environment for specifying computation models, multiprocessor architectures, data collection, and performance visualization. AXE is being used at NASA-Ames for developing resource management strategies, parallel problem formulation, multiprocessor architectures, and operating system issues related to the High Performance Computing and Communications Program. AXE's simple, structured user-interface enables the user to model parallel programs and machines precisely and efficiently. Its quick turn-around time keeps the user interested and productive. AXE models multicomputers. The user may easily modify various architectural parameters including the number of sites, connection topologies, and overhead for operating system activities. Parallel computations in AXE are represented as collections of autonomous computing objects known as players. Their use and behavior is described. Performance data of the multiprocessor model can be observed on a color screen. These include CPU and message routing bottlenecks, and the dynamic status of the software.

  13. Advanced parallel programming models research and development opportunities.

    SciTech Connect

    Wen, Zhaofang.; Brightwell, Ronald Brian

    2004-07-01

    There is currently a large research and development effort within the high-performance computing community on advanced parallel programming models. This research can potentially have an impact on parallel applications, system software, and computing architectures in the next several years. Given Sandia's expertise and unique perspective in these areas, particularly on very large-scale systems, there are many areas in which Sandia can contribute to this effort. This technical report provides a survey of past and present parallel programming model research projects and provides a detailed description of the Partitioned Global Address Space (PGAS) programming model. The PGAS model may offer several improvements over the traditional distributed memory message passing model, which is the dominant model currently being used at Sandia. This technical report discusses these potential benefits and outlines specific areas where Sandia's expertise could contribute to current research activities. In particular, we describe several projects in the areas of high-performance networking, operating systems and parallel runtime systems, compilers, application development, and performance evaluation.

  14. Testing New Programming Paradigms with NAS Parallel Benchmarks

    NASA Technical Reports Server (NTRS)

    Jin, H.; Frumkin, M.; Schultz, M.; Yan, J.

    2000-01-01

    Over the past decade, high performance computing has evolved rapidly, not only in hardware architectures but also with increasing complexity of real applications. Technologies have been developing to aim at scaling up to thousands of processors on both distributed and shared memory systems. Development of parallel programs on these computers is always a challenging task. Today, writing parallel programs with message passing (e.g. MPI) is the most popular way of achieving scalability and high performance. However, writing message passing programs is difficult and error prone. Recent years new effort has been made in defining new parallel programming paradigms. The best examples are: HPF (based on data parallelism) and OpenMP (based on shared memory parallelism). Both provide simple and clear extensions to sequential programs, thus greatly simplify the tedious tasks encountered in writing message passing programs. HPF is independent of memory hierarchy, however, due to the immaturity of compiler technology its performance is still questionable. Although use of parallel compiler directives is not new, OpenMP offers a portable solution in the shared-memory domain. Another important development involves the tremendous progress in the internet and its associated technology. Although still in its infancy, Java promisses portability in a heterogeneous environment and offers possibility to "compile once and run anywhere." In light of testing these new technologies, we implemented new parallel versions of the NAS Parallel Benchmarks (NPBs) with HPF and OpenMP directives, and extended the work with Java and Java-threads. The purpose of this study is to examine the effectiveness of alternative programming paradigms. NPBs consist of five kernels and three simulated applications that mimic the computation and data movement of large scale computational fluid dynamics (CFD) applications. We started with the serial version included in NPB2.3. Optimization of memory and cache usage

  15. Performance Evaluation Methodologies and Tools for Massively Parallel Programs

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Sarukkai, Sekhar; Tucker, Deanne (Technical Monitor)

    1994-01-01

    The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessors. However, without effective means to monitor (and analyze) program execution, tuning the performance of parallel programs becomes exponentially difficult as program complexity and machine size increase. The recent introduction of performance tuning tools from various supercomputer vendors (Intel's ParAide, TMC's PRISM, CSI'S Apprentice, and Convex's CXtrace) seems to indicate the maturity of performance tool technologies and vendors'/customers' recognition of their importance. However, a few important questions remain: What kind of performance bottlenecks can these tools detect (or correct)? How time consuming is the performance tuning process? What are some important technical issues that remain to be tackled in this area? This workshop reviews the fundamental concepts involved in analyzing and improving the performance of parallel and heterogeneous message-passing programs. Several alternative strategies will be contrasted, and for each we will describe how currently available tuning tools (e.g., AIMS, ParAide, PRISM, Apprentice, CXtrace, ATExpert, Pablo, IPS-2)) can be used to facilitate the process. We will characterize the effectiveness of the tools and methodologies based on actual user experiences at NASA Ames Research Center. Finally, we will discuss their limitations and outline recent approaches taken by vendors and the research community to address them.

  16. Methodologies and Tools for Tuning Parallel Programs: Facts and Fantasies

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Lum, Henry, Jr. (Technical Monitor)

    1994-01-01

    The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessors. However, without effective means to monitor (and analyze) program execution, tuning the performance of parallel programs becomes exponentially difficult as program complexity and machine size increase. The recent introduction of performance tuning tools from various supercomputer vendors (Intel's ParAide, TMC's PRISM, CRI's Apprentice, and Convex's CXtrace) seems to indicate the maturity of performance tool technologies and vendors'/customers' recognition of their importance. However, a few important questions remain: What kind of performance bottlenecks can these tools detect (or correct)? How time consuming is the performance tuning process? What are some important technical issues that remain to be tackled in this area? This workshop reviews the fundamental concepts involved in analyzing and improving the performance of parallel and heterogeneous message-passing programs. Several alternative strategies will be contrasted, and for each we will describe how currently available tuning tools (e.g. AIMS, ParAide, PRISM, Apprentice, CXtrace, ATExpert, Pablo, IPS-2) can be used to facilitate the process. We will characterize the effectiveness of the tools and methodologies based on actual user experiences at NASA Ames Research Center. Finally, we will discuss their limitations and outline recent approaches taken by vendors and the research community to address them.

  17. Final Report: Center for Programming Models for Scalable Parallel Computing

    SciTech Connect

    Mellor-Crummey, John

    2011-09-13

    As part of the Center for Programming Models for Scalable Parallel Computing, Rice University collaborated with project partners in the design, development and deployment of language, compiler, and runtime support for parallel programming models to support application development for the “leadership-class” computer systems at DOE national laboratories. Work over the course of this project has focused on the design, implementation, and evaluation of a second-generation version of Coarray Fortran. Research and development efforts of the project have focused on the CAF 2.0 language, compiler, runtime system, and supporting infrastructure. This has involved working with the teams that provide infrastructure for CAF that we rely on, implementing new language and runtime features, producing an open source compiler that enabled us to evaluate our ideas, and evaluating our design and implementation through the use of benchmarks. The report details the research, development, findings, and conclusions from this work.

  18. On the utility of threads for data parallel programming

    NASA Technical Reports Server (NTRS)

    Fahringer, Thomas; Haines, Matthew; Mehrotra, Piyush

    1995-01-01

    Threads provide a useful programming model for asynchronous behavior because of their ability to encapsulate units of work that can then be scheduled for execution at runtime, based on the dynamic state of a system. Recently, the threaded model has been applied to the domain of data parallel scientific codes, and initial reports indicate that the threaded model can produce performance gains over non-threaded approaches, primarily through the use of overlapping useful computation with communication latency. However, overlapping computation with communication is possible without the benefit of threads if the communication system supports asynchronous primitives, and this comparison has not been made in previous papers. This paper provides a critical look at the utility of lightweight threads as applied to data parallel scientific programming.

  19. MLP: A Parallel Programming Alternative to MPI for New Shared Memory Parallel Systems

    NASA Technical Reports Server (NTRS)

    Taft, James R.

    1999-01-01

    Recent developments at the NASA AMES Research Center's NAS Division have demonstrated that the new generation of NUMA based Symmetric Multi-Processing systems (SMPs), such as the Silicon Graphics Origin 2000, can successfully execute legacy vector oriented CFD production codes at sustained rates far exceeding processing rates possible on dedicated 16 CPU Cray C90 systems. This high level of performance is achieved via shared memory based Multi-Level Parallelism (MLP). This programming approach, developed at NAS and outlined below, is distinct from the message passing paradigm of MPI. It offers parallelism at both the fine and coarse grained level, with communication latencies that are approximately 50-100 times lower than typical MPI implementations on the same platform. Such latency reductions offer the promise of performance scaling to very large CPU counts. The method draws on, but is also distinct from, the newly defined OpenMP specification, which uses compiler directives to support a limited subset of multi-level parallel operations. The NAS MLP method is general, and applicable to a large class of NASA CFD codes.

  20. An informal introduction to program transformation and parallel processors

    SciTech Connect

    Hopkins, K.W.

    1994-08-01

    In the summer of 1992, I had the opportunity to participate in a Faculty Research Program at Argonne National Laboratory. I worked under Dr. Jim Boyle on a project transforming code written in pure functional Lisp to Fortran code to run on distributed-memory parallel processors. To perform this project, I had to learn three things: the transformation system, the basics of distributed-memory parallel machines, and the Lisp programming language. Each of these topics in computer science was unfamiliar to me as a mathematician, but I found that they (especially parallel processing) are greatly impacting many fields of mathematics and science. Since most mathematicians have some exposure to computers, but.certainly are not computer scientists, I felt it was appropriate to write a paper summarizing my introduction to these areas and how they can fit together. This paper is not meant to be a full explanation of the topics, but an informal introduction for the ``mathematical layman.`` I place myself in that category as well as my previous use of computers was as a classroom demonstration tool.

  1. Users manual for the Chameleon parallel programming tools

    SciTech Connect

    Gropp, W.; Smith, B.

    1993-06-01

    Message passing is a common method for writing programs for distributed-memory parallel computers. Unfortunately, the lack of a standard for message passing has hampered the construction of portable and efficient parallel programs. In an attempt to remedy this problem, a number of groups have developed their own message-passing systems, each with its own strengths and weaknesses. Chameleon is a second-generation system of this type. Rather than replacing these existing systems, Chameleon is meant to supplement them by providing a uniform way to access many of these systems. Chameleon`s goals are to (a) be very lightweight (low over-head), (b) be highly portable, and (c) help standardize program startup and the use of emerging message-passing operations such as collective operations on subsets of processors. Chameleon also provides a way to port programs written using PICL or Intel NX message passing to other systems, including collections of workstations. Chameleon is tracking the Message-Passing Interface (MPI) draft standard and will provide both an MPI implementation and an MPI transport layer. Chameleon provides support for heterogeneous computing by using p4 and PVM. Chameleon`s support for homogeneous computing includes the portable libraries p4, PICL, and PVM and vendor-specific implementation for Intel NX, IBM EUI (SP-1), and Thinking Machines CMMD (CM-5). Support for Ncube and PVM 3.x is also under development.

  2. Automated Performance Prediction of Message-Passing Parallel Programs

    NASA Technical Reports Server (NTRS)

    Block, Robert J.; Sarukkai, Sekhar; Mehra, Pankaj; Woodrow, Thomas S. (Technical Monitor)

    1995-01-01

    The increasing use of massively parallel supercomputers to solve large-scale scientific problems has generated a need for tools that can predict scalability trends of applications written for these machines. Much work has been done to create simple models that represent important characteristics of parallel programs, such as latency, network contention, and communication volume. But many of these methods still require substantial manual effort to represent an application in the model's format. The NIK toolkit described in this paper is the result of an on-going effort to automate the formation of analytic expressions of program execution time, with a minimum of programmer assistance. In this paper we demonstrate the feasibility of our approach, by extending previous work to detect and model communication patterns automatically, with and without overlapped computations. The predictions derived from these models agree, within reasonable limits, with execution times of programs measured on the Intel iPSC/860 and Paragon. Further, we demonstrate the use of MK in selecting optimal computational grain size and studying various scalability metrics.

  3. NavP: Structured and Multithreaded Distributed Parallel Programming

    NASA Technical Reports Server (NTRS)

    Pan, Lei

    2007-01-01

    We present Navigational Programming (NavP) -- a distributed parallel programming methodology based on the principles of migrating computations and multithreading. The four major steps of NavP are: (1) Distribute the data using the data communication pattern in a given algorithm; (2) Insert navigational commands for the computation to migrate and follow large-sized distributed data; (3) Cut the sequential migrating thread and construct a mobile pipeline; and (4) Loop back for refinement. NavP is significantly different from the current prevailing Message Passing (MP) approach. The advantages of NavP include: (1) NavP is structured distributed programming and it does not change the code structure of an original algorithm. This is in sharp contrast to MP as MP implementations in general do not resemble the original sequential code; (2) NavP implementations are always competitive with the best MPI implementations in terms of performance. Approaches such as DSM or HPF have failed to deliver satisfying performance as of today in contrast, even if they are relatively easy to use compared to MP; (3) NavP provides incremental parallelization, which is beyond the reach of MP; and (4) NavP is a unifying approach that allows us to exploit both fine- (multithreading on shared memory) and coarse- (pipelined tasks on distributed memory) grained parallelism. This is in contrast to the currently popular hybrid use of MP+OpenMP, which is known to be complex to use. We present experimental results that demonstrate the effectiveness of NavP.

  4. A scalable parallel algorithm for multiple objective linear programs

    NASA Technical Reports Server (NTRS)

    Wiecek, Malgorzata M.; Zhang, Hong

    1994-01-01

    This paper presents an ADBASE-based parallel algorithm for solving multiple objective linear programs (MOLP's). Job balance, speedup and scalability are of primary interest in evaluating efficiency of the new algorithm. Implementation results on Intel iPSC/2 and Paragon multiprocessors show that the algorithm significantly speeds up the process of solving MOLP's, which is understood as generating all or some efficient extreme points and unbounded efficient edges. The algorithm gives specially good results for large and very large problems. Motivation and justification for solving such large MOLP's are also included.

  5. 78 FR 76628 - Pilot Program for Parallel Review of Medical Products; Extension of the Duration of the Program

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-12-18

    ... Parallel Review of Medical Products; Extension of the Duration of the Program AGENCIES: Food and Drug... extension of the ``Pilot Program for Parallel Review of Medical Products.'' ] The Agencies have decided to... the Agencies. In the October 11, 2011 (76 FR 62808), Parallel Review Pilot Program notice,...

  6. Parallelization and checkpointing of GPU applications through program transformation

    SciTech Connect

    Solano-Quinde, Lizandro Damian

    2012-01-01

    GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and

  7. An empirical study of Fortran programs for parallelizing compilers

    SciTech Connect

    Shen, Z. ); Li, Z. ); Yew, P.C. . Center for Supercomputing Research and Development)

    1990-07-01

    In this paper, the authors report some results from an empirical study of program characteristics that are important to parallelizing compiler writers, especially in the area of data dependence analysis and program transformations. The state of the art in data dependence analysis and some parallel execution techniques are also examined. The major findings include: many subscripts contain symbolic terms with unknown values. A few methods to determine their values at compile time are evaluated; array references with coupled subscripts appear quite frequently. These subscripts must be handled simultaneously in a dependence test, rather than being handled separately as in current test algorithms; nonzero coefficients of loop indexes in most subscripts are found to be simple: they are either 1 or -1. It allows an exact real-valued test to be as accurate as an exact integer-valued test for one-dimensional or two-dimensional arrays; dependences with uncertain distance are found to be rather common, and one of the main reasons is the frequent appearance of symbolic terms with unknown values.

  8. A Programming Model Performance Study Using the NAS Parallel Benchmarks

    DOE PAGES

    Shan, Hongzhang; Blagojević, Filip; Min, Seung-Jai; Hargrove, Paul; Jin, Haoqiang; Fuerlinger, Karl; Koniges, Alice; Wright, Nicholas J.

    2010-01-01

    Harnessing the power of multicore platforms is challenging due to the additional levels of parallelism present. In this paper we use the NAS Parallel Benchmarks to study three programming models, MPI, OpenMP and PGAS to understand their performance and memory usage characteristics on current multicore architectures. To understand these characteristics we use the Integrated Performance Monitoring tool and other ways to measure communication versus computation time, as well as the fraction of the run time spent in OpenMP. The benchmarks are run on two different Cray XT5 systems and an Infiniband cluster. Our results show that in general the threemore » programming models exhibit very similar performance characteristics. In a few cases, OpenMP is significantly faster because it explicitly avoids communication. For these particular cases, we were able to re-write the UPC versions and achieve equal performance to OpenMP. Using OpenMP was also the most advantageous in terms of memory usage. Also we compare performance differences between the two Cray systems, which have quad-core and hex-core processors. We show that at scale the performance is almost always slower on the hex-core system because of increased contention for network resources.« less

  9. An empirical study of FORTRAN programs for parallelizing compilers

    NASA Technical Reports Server (NTRS)

    Shen, Zhiyu; Li, Zhiyuan; Yew, Pen-Chung

    1990-01-01

    Some results are reported from an empirical study of program characteristics that are important in parallelizing compiler writers, especially in the area of data dependence analysis and program transformations. The state of the art in data dependence analysis and some parallel execution techniques are examined. The major findings are included. Many subscripts contain symbolic terms with unknown values. A few methods of determining their values at compile time are evaluated. Array references with coupled subscripts appear quite frequently; these subscripts must be handled simultaneously in a dependence test, rather than being handled separately as in current test algorithms. Nonzero coefficients of loop indexes in most subscripts are found to be simple: they are either 1 or -1. This allows an exact real-valued test to be as accurate as an exact integer-valued test for one-dimensional or two-dimensional arrays. Dependencies with uncertain distance are found to be rather common, and one of the main reasons is the frequent appearance of symbolic terms with unknown values.

  10. Center for Programming Models for Scalable Parallel Computing: Future Programming Models

    SciTech Connect

    Gao, Guang, R.

    2008-07-24

    The mission of the pmodel center project is to develop software technology to support scalable parallel programming models for terascale systems. The goal of the specific UD subproject is in the context developing an efficient and robust methodology and tools for HPC programming. More specifically, the focus is on developing new programming models which facilitate programmers in porting their application onto parallel high performance computing systems. During the course of the research in the past 5 years, the landscape of microprocessor chip architecture has witnessed a fundamental change – the emergence of multi-core/many-core chip architecture appear to become the mainstream technology and will have a major impact to for future generation parallel machines. The programming model for shared-address space machines is becoming critical to such multi-core architectures. Our research highlight is the in-depth study of proposed fine-grain parallelism/multithreading support on such future generation multi-core architectures. Our research has demonstrated the significant impact such fine-grain multithreading model can have on the productivity of parallel programming models and their efficient implementation.

  11. A Parallel Vector Machine for the PM Programming Language

    NASA Astrophysics Data System (ADS)

    Bellerby, Tim

    2016-04-01

    PM is a new programming language which aims to make the writing of computational geoscience models on parallel hardware accessible to scientists who are not themselves expert parallel programmers. It is based around the concept of communicating operators: language constructs that enable variables local to a single invocation of a parallelised loop to be viewed as if they were arrays spanning the entire loop domain. This mechanism enables different loop invocations (which may or may not be executing on different processors) to exchange information in a manner that extends the successful Communicating Sequential Processes idiom from single messages to collective communication. Communicating operators avoid the additional synchronisation mechanisms, such as atomic variables, required when programming using the Partitioned Global Address Space (PGAS) paradigm. Using a single loop invocation as the fundamental unit of concurrency enables PM to uniformly represent different levels of parallelism from vector operations through shared memory systems to distributed grids. This paper describes an implementation of PM based on a vectorised virtual machine. On a single processor node, concurrent operations are implemented using masked vector operations. Virtual machine instructions operate on vectors of values and may be unmasked, masked using a Boolean field, or masked using an array of active vector cell locations. Conditional structures (such as if-then-else or while statement implementations) calculate and apply masks to the operations they control. A shift in mask representation from Boolean to location-list occurs when active locations become sufficiently sparse. Parallel loops unfold data structures (or vectors of data structures for nested loops) into vectors of values that may additionally be distributed over multiple computational nodes and then split into micro-threads compatible with the size of the local cache. Inter-node communication is accomplished using

  12. Parallel functional programming in Sisal: Fictions, facts, and future

    SciTech Connect

    McGraw, J.R.

    1993-07-01

    This paper provides a status report on the progress of research and development on the functional language Sisal. This project focuses on providing a highly effective method of writing large scientific applications that can efficiently execute on a spectrum of different multiprocessors. The paper includes sections on the language definition, compilation strategies, and programming techniques intended for readers with little or no background with Sisal. The section on performance presents our most recent results on execution speed for shared-memory multiprocessors, our findings using Sisal to develop codes, and our experiences migrating the same source code to different machines. For large programs, the execution performance of Sisal (with minimal supporting advice from the programmer) usually exceeds that of the best available automatic, vector/parallel Fortran compilers. Our evidence also indicates that Sisal programs tend to be shorter in length, faster to write, and dearer to understand than equivalent algorithms in Fortran. The paper concludes with a substantial discussion of common criticisms of the language and our plans for addressing them. Most notably, efficient implementations for distributed memory machines are lacking; an issue we plan to remedy.

  13. Performance Measurement, Visualization and Modeling of Parallel and Distributed Programs

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Sarukkai, Sekhar R.; Mehra, Pankaj; Lum, Henry, Jr. (Technical Monitor)

    1994-01-01

    This paper presents a methodology for debugging the performance of message-passing programs on both tightly coupled and loosely coupled distributed-memory machines. The AIMS (Automated Instrumentation and Monitoring System) toolkit, a suite of software tools for measurement and analysis of performance, is introduced and its application illustrated using several benchmark programs drawn from the field of computational fluid dynamics. AIMS includes (i) Xinstrument, a powerful source-code instrumentor, which supports both Fortran77 and C as well as a number of different message-passing libraries including Intel's NX Thinking Machines' CMMD, and PVM; (ii) Monitor, a library of timestamping and trace -collection routines that run on supercomputers (such as Intel's iPSC/860, Delta, and Paragon and Thinking Machines' CM5) as well as on networks of workstations (including Convex Cluster and SparcStations connected by a LAN); (iii) Visualization Kernel, a trace-animation facility that supports source-code clickback, simultaneous visualization of computation and communication patterns, as well as analysis of data movements; (iv) Statistics Kernel, an advanced profiling facility, that associates a variety of performance data with various syntactic components of a parallel program; (v) Index Kernel, a diagnostic tool that helps pinpoint performance bottlenecks through the use of abstract indices; (vi) Modeling Kernel, a facility for automated modeling of message-passing programs that supports both simulation -based and analytical approaches to performance prediction and scalability analysis; (vii) Intrusion Compensator, a utility for recovering true performance from observed performance by removing the overheads of monitoring and their effects on the communication pattern of the program; and (viii) Compatibility Tools, that convert AIMS-generated traces into formats used by other performance-visualization tools, such as ParaGraph, Pablo, and certain AVS/Explorer modules.

  14. 76 FR 62808 - Pilot Program for Parallel Review of Medical Products

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-10-11

    ... Parallel Review of Medical Products AGENCY: Food and Drug Administration, Centers for Medicare and Medicaid..., Federal Register notice (75 FR 57045), parallel review is intended to reduce the time between FDA... regulated medical products. We also stated our intention to initiate a pilot program for parallel review...

  15. Rochester checkers player: Multi-model parallel programming for animate vision. Technical report

    SciTech Connect

    Marsh, B.D.; Brown, C.M.; LeBlanc, T.J.; Scott, M.L.; Becker, T.G.

    1991-06-01

    Animate vision systems couple computer vision and robotics to achieve robust and accurate vision, as well as other complex behavior. These systems combine low-level sensory processing and effector output with high-level cognitive planning - all computationally intensive tasks that can benefit from parallel processing. No single model of parallel programming is likely to serve for all tasks, however. Early vision algorithms are intensely data parallel, often utilizing fine-grain parallel computations that share an image, while cognition algorithms decompose naturally by function, often consisting of loosely-coupled, coarse-grain parallel units. A typical animate vision application will likely consist of many tasks, each of which may require a different parallel programming model, and all of which must cooperate to achieve the desired behavior. These multi-model programs require an underlying software system that not only supports several different models of parallel computation simultaneously, but which also allows tasks implemented in different models to interact.

  16. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems.

    PubMed

    Stone, John E; Gohara, David; Shi, Guochun

    2010-05-01

    We provide an overview of the key architectural features of recent microprocessor designs and describe the programming model and abstractions provided by OpenCL, a new parallel programming standard targeting these architectures.

  17. 76 FR 66309 - Pilot Program for Parallel Review of Medical Products; Correction

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-10-26

    ... Parallel Review of Medical Products; Correction AGENCY: Food and Drug Administration, Centers for Medicare... Federal Register of October 11, 2011 (76 FR 62808). The document announced a pilot program for sponsors of innovative device technologies to participate in a program of parallel FDA-CMS review. The document...

  18. Programming with a high degree of parallelism in fortran

    NASA Astrophysics Data System (ADS)

    Jesshope, C. R.

    1982-06-01

    Many parallel extensions to FORTRAN have been proposed by 'supercomputer' manufacturers. The major differences between these language extensions is reviewed briefly. The Principle of Conservation of Parallelism is also introduced, which is argued to be a desirable foundation on which to base the development of code for parallel computers. Simply stated it requires that the degree of parallelism should not increase during the translation of an algorithm from a concept to a high level language (FORTRAN say) and finally into the machine code of the target computer. Cray FORTRAN and other vectorising compilers do not adhere to this principle, as the parallelism increases from 1 to some greater degree during the compilation process. A simple example will be used to illustrate the implications of this principle, which shows that it will reduce operations at the expense of storage locations. Vectorising compilers may reduce this storage requirement but will increase the number of operations. Two further examples of highly parallel and practical codes are also presented. These illustrate the compactness of code and the close relationship between the mathematical description of the problem and the FORTRAN implementation. The examples show the matrix multiplication and fast Fourier transform algorithms.

  19. Portable programming on parallel/networked computers using the Application Portable Parallel Library (APPL)

    NASA Technical Reports Server (NTRS)

    Quealy, Angela; Cole, Gary L.; Blech, Richard A.

    1993-01-01

    The Application Portable Parallel Library (APPL) is a subroutine-based library of communication primitives that is callable from applications written in FORTRAN or C. APPL provides a consistent programmer interface to a variety of distributed and shared-memory multiprocessor MIMD machines. The objective of APPL is to minimize the effort required to move parallel applications from one machine to another, or to a network of homogeneous machines. APPL encompasses many of the message-passing primitives that are currently available on commercial multiprocessor systems. This paper describes APPL (version 2.3.1) and its usage, reports the status of the APPL project, and indicates possible directions for the future. Several applications using APPL are discussed, as well as performance and overhead results.

  20. Parallel logic programming and parallel systems software and hardware. Progress report (Final), 1 April 1988-31 March 1989

    SciTech Connect

    Minker, J.

    1989-07-29

    This progress report summarizes work performed under AFOSR-88-0152 on parallel logic programming, problem solving, and deductive data bases. A parallel problem-solving system, PRISM (Parallel Inference System), that was implemented on McMOB was ported to the BBN Butterfly machine. Two versions of PRISM were developed and are operational on the Butterfly: a message-passing ring-structure system and a shared-memory system. Experimental testing of PRISM on McMOB continued, while experiments were also conducted on the Butterfly systems. Three enhancements were made and completed during the grant period. These are: a capability to handle negated queries and a capability to assert and retract statements. In addition to the above, work continued in the area of informative answers to queries in deductive data bases. A thesis was completed on the subject. An interpreter was developed and is running, that can take restricted natural language as input and can respond with a cooperative natural language output. In the area of parallel software development, the following were accomplished. Theoretical work on slicing/splicing was completed. Tools were provided for software development using artificial-intelligence techniques. AI software for massively parallel architectures was started.

  1. Retargeting of existing FORTRAN program and development of parallel compilers

    NASA Technical Reports Server (NTRS)

    Agrawal, Dharma P.

    1988-01-01

    The software models used in implementing the parallelizing compiler for the B-HIVE multiprocessor system are described. The various models and strategies used in the compiler development are: flexible granularity model, which allows a compromise between two extreme granularity models; communication model, which is capable of precisely describing the interprocessor communication timings and patterns; loop type detection strategy, which identifies different types of loops; critical path with coloring scheme, which is a versatile scheduling strategy for any multicomputer with some associated communication costs; and loop allocation strategy, which realizes optimum overlapped operations between computation and communication of the system. Using these models, several sample routines of the AIR3D package are examined and tested. It may be noted that automatically generated codes are highly parallelized to provide the maximized degree of parallelism, obtaining the speedup up to a 28 to 32-processor system. A comparison of parallel codes for both the existing and proposed communication model, is performed and the corresponding expected speedup factors are obtained. The experimentation shows that the B-HIVE compiler produces more efficient codes than existing techniques. Work is progressing well in completing the final phase of the compiler. Numerous enhancements are needed to improve the capabilities of the parallelizing compiler.

  2. How to write fast and clear parallel programs using algebra

    SciTech Connect

    Stiller, L. Johns Hopkins Univ., Baltimore, MD )

    1992-01-01

    An algebraic method for the design of efficient and easy to port codes for parallel machines is described. The method was applied to speed up and to clarify certain communication functions, n-body codes, a biomolecular analysis, and a chess problem.

  3. How to write fast and clear parallel programs using algebra

    SciTech Connect

    Stiller, L. |

    1992-10-01

    An algebraic method for the design of efficient and easy to port codes for parallel machines is described. The method was applied to speed up and to clarify certain communication functions, n-body codes, a biomolecular analysis, and a chess problem.

  4. PRAM C:a new programming environment for fine-grain and coarse-grain parallelism.

    SciTech Connect

    Brown, Jonathan Leighton; Wen, Zhaofang.

    2004-11-01

    In the search for ''good'' parallel programming environments for Sandia's current and future parallel architectures, they revisit a long-standing open question. Can the PRAM parallel algorithms designed by theoretical computer scientists over the last two decades be implemented efficiently? This open question has co-existed with ongoing efforts in the HPC community to develop practical parallel programming models that can simultaneously provide ease of use, expressiveness, performance, and scalability. Unfortunately, no single model has met all these competing requirements. Here they propose a parallel programming environment, PRAM C, to bridge the gap between theory and practice. This is an attempt to provide an affirmative answer to the PRAM question, and to satisfy these competing practical requirements. This environment consists of a new thin runtime layer and an ANSI C extension. The C extension has two control constructs and one additional data type concept, ''shared''. This C extension should enable easy translation from PRAM algorithms to real parallel programs, much like the translation from sequential algorithms to C programs. The thin runtime layer bundles fine-grained communication requests into coarse-grained communication to be served by message-passing. Although the PRAM represents SIMD-style fine-grained parallelism, a stand-alone PRAM C environment can support both fine-grained and coarse-grained parallel programming in either a MIMD or SPMD style, interoperate with existing MPI libraries, and use existing hardware. The PRAM C model can also be integrated easily with existing models. Unlike related efforts proposing innovative hardware with the goal to realize the PRAM, ours can be a pure software solution with the purpose to provide a practical programming environment for existing parallel machines; it also has the potential to perform well on future parallel architectures.

  5. Programming environment for parallel-vision algorithms. Final technical report, February 1988-December 1989

    SciTech Connect

    Brown, C.

    1990-04-11

    This contract developed and disseminated papers, ideas, algorithms, analysis, software, applications, and implementations for parallel programming environments for computer vision and for vision applications. The work has been widely reported and highly influential. The most significant work centered on the Butterfly Parallel Processor, the MaxVideo pipelined parallel image processor, and the development of the real-time computer vision laboratory. For the Butterfly, the Psyche multi-model operating system was developed and the CONSUL autoparallelizing compiler was designed. Much basic and influential performance monitoring and debugging work was completed, resulting in working systems and novel algorithms. There was also significant research in systems and applications using other parallel architectures in the laboratory, such as the MaxVideo parallel pipelined image processor. The contract developed a heterogeneous parallel architecture involving pipelined and MIMD parallelism and integrated it with a robot head.

  6. Object-Oriented NeuroSys: Parallel Programs for Simulating Large Networks of Biologically Accurate Neurons

    SciTech Connect

    Pacheco, P; Miller, P; Kim, J; Leese, T; Zabiyaka, Y

    2003-05-07

    Object-oriented NeuroSys (ooNeuroSys) is a collection of programs for simulating very large networks of biologically accurate neurons on distributed memory parallel computers. It includes two principle programs: ooNeuroSys, a parallel program for solving the large systems of ordinary differential equations arising from the interconnected neurons, and Neurondiz, a parallel program for visualizing the results of ooNeuroSys. Both programs are designed to be run on clusters and use the MPI library to obtain parallelism. ooNeuroSys also includes an easy-to-use Python interface. This interface allows neuroscientists to quickly develop and test complex neuron models. Both ooNeuroSys and Neurondiz have a design that allows for both high performance and relative ease of maintenance.

  7. Communications oriented programming of parallel iterative solutions of sparse linear systems

    NASA Technical Reports Server (NTRS)

    Patrick, M. L.; Pratt, T. W.

    1986-01-01

    Parallel algorithms are developed for a class of scientific computational problems by partitioning the problems into smaller problems which may be solved concurrently. The effectiveness of the resulting parallel solutions is determined by the amount and frequency of communication and synchronization and the extent to which communication can be overlapped with computation. Three different parallel algorithms for solving the same class of problems are presented, and their effectiveness is analyzed from this point of view. The algorithms are programmed using a new programming environment. Run-time statistics and experience obtained from the execution of these programs assist in measuring the effectiveness of these algorithms.

  8. Adapting high-level language programs for parallel processing using data flow

    NASA Technical Reports Server (NTRS)

    Standley, Hilda M.

    1988-01-01

    EASY-FLOW, a very high-level data flow language, is introduced for the purpose of adapting programs written in a conventional high-level language to a parallel environment. The level of parallelism provided is of the large-grained variety in which parallel activities take place between subprograms or processes. A program written in EASY-FLOW is a set of subprogram calls as units, structured by iteration, branching, and distribution constructs. A data flow graph may be deduced from an EASY-FLOW program.

  9. Buffered coscheduling for parallel programming and enhanced fault tolerance

    DOEpatents

    Petrini, Fabrizio; Feng, Wu-chun

    2006-01-31

    A computer implemented method schedules processor jobs on a network of parallel machine processors or distributed system processors. Control information communications generated by each process performed by each processor during a defined time interval is accumulated in buffers, where adjacent time intervals are separated by strobe intervals for a global exchange of control information. A global exchange of the control information communications at the end of each defined time interval is performed during an intervening strobe interval so that each processor is informed by all of the other processors of the number of incoming jobs to be received by each processor in a subsequent time interval. The buffered coscheduling method of this invention also enhances the fault tolerance of a network of parallel machine processors or distributed system processors

  10. A data-driven parallel execution model and architecture for logic programs

    SciTech Connect

    Tseng, Chien-Chao.

    1989-01-01

    Logic Programming has come to prominence in recent years after the decision of the Japanese Fifth Generation Project to adopt it as the kernel language. A significant number of research projects are attempting to implement different schemes to exploit the inherent parallelism in logic programs. Data flow architectural model has been found to attractive for parallel execution of logic programs. In this research, five dataflow execution models available in literature, have been critically reviewed. The primary aim of the critical review was to establish a set of design issues critical to efficient execution. Based on the established design issues, the abstract date - driven machine model, names LogDf, is developed for parallel execution of logic programs. The execution scheme supports OR - parallelism, Restricted AND parallelism and stream parallelism. Multiple binding environments are represented using stream of streams structure (S-stream). Eager evaluation is performed by passing binding environment between subgoal literals as S-streams, which are formed using non-strict constructors. The hierarchical multi-level stream structure provides a logical framework for distributing the streams to enhance parallelism in production/consumption as well as control of parallelism. The scheme for compiling the dataflow graphs, developed in this thesis, eliminates the necessity of any operand matching unit in the underlying dynamic dataflow architecture. In this thesis, an architecture for the abstract machine LogDf is also provided and the performance evaluation of this model is based on this architecture.

  11. A practical algorithm for static analysis of parallel programs

    SciTech Connect

    McDowell, C.E. )

    1989-06-01

    One approach to analyzing the behavior of a concurrent program requires determining the reachable program states. A program state consists of a set of task states, the values of shared variables used for synchronization, and local variables that derive the values directly from synchronization operations. However, the number of reachable states rises exponentially with the number of tasks and becomes intractable for many concurrent programs. A variation of this approach merges a set of related states into a single virtual state. Using this approach, the analysis of concurrent programs becomes feasible as the number of virtual states is often orders of magnitude less than the number of reachable states. This paper presents a method for determining the virtual states that describe the reachable program states, and the reduction in the number of states is analyzed. The algorithms given have been implemented in a state program analyzer for multitasking Fortran, and the results obtained are discussed.

  12. Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters

    NASA Astrophysics Data System (ADS)

    Yang, Chao-Tung; Huang, Chih-Lin; Lin, Cheng-Fang

    2011-01-01

    Nowadays, NVIDIA's CUDA is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions - a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has proven quite successful at programming multithreaded many core GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are already using CUDA to achieve dramatic speedups on production and research codes. In this paper, we propose a parallel programming approach using hybrid CUDA OpenMP, and MPI programming, which partition loop iterations according to the number of C1060 GPU nodes in a GPU cluster which consists of one C1060 and one S1070. Loop iterations assigned to one MPI process are processed in parallel by CUDA run by the processor cores in the same computational node.

  13. F-Nets and Software Cabling: Deriving a Formal Model and Language for Portable Parallel Programming

    NASA Technical Reports Server (NTRS)

    DiNucci, David C.; Saini, Subhash (Technical Monitor)

    1998-01-01

    Parallel programming is still being based upon antiquated sequence-based definitions of the terms "algorithm" and "computation", resulting in programs which are architecture dependent and difficult to design and analyze. By focusing on obstacles inherent in existing practice, a more portable model is derived here, which is then formalized into a model called Soviets which utilizes a combination of imperative and functional styles. This formalization suggests more general notions of algorithm and computation, as well as insights into the meaning of structured programming in a parallel setting. To illustrate how these principles can be applied, a very-high-level graphical architecture-independent parallel language, called Software Cabling, is described, with many of the features normally expected from today's computer languages (e.g. data abstraction, data parallelism, and object-based programming constructs).

  14. Resolutions of the Coulomb operator: VIII. Parallel implementation using the modern programming language X10.

    PubMed

    Limpanuparb, Taweetham; Milthorpe, Josh; Rendell, Alistair P

    2014-10-30

    Use of the modern parallel programming language X10 for computing long-range Coulomb and exchange interactions is presented. By using X10, a partitioned global address space language with support for task parallelism and the explicit representation of data locality, the resolution of the Ewald operator can be parallelized in a straightforward manner including use of both intranode and internode parallelism. We evaluate four different schemes for dynamic load balancing of integral calculation using X10's work stealing runtime, and report performance results for long-range HF energy calculation of large molecule/high quality basis running on up to 1024 cores of a high performance cluster machine.

  15. Backtracking and Re-execution in the Automatic Debugging of Parallelized Programs

    NASA Technical Reports Server (NTRS)

    Matthews, Gregory; Hood, Robert; Johnson, Stephen; Leggett, Peter; Biegel, Bryan (Technical Monitor)

    2002-01-01

    In this work we describe a new approach using relative debugging to find differences in computation between a serial program and a parallel version of th it program. We use a combination of re-execution and backtracking in order to find the first difference in computation that may ultimately lead to an incorrect value that the user has indicated. In our prototype implementation we use static analysis information from a parallelization tool in order to perform the backtracking as well as the mapping required between serial and parallel computations.

  16. Multiprocessor speed-up, Amdahl's Law, and the Activity Set Model of parallel program behavior

    NASA Technical Reports Server (NTRS)

    Gelenbe, Erol

    1988-01-01

    An important issue in the effective use of parallel processing is the estimation of the speed-up one may expect as a function of the number of processors used. Amdahl's Law has traditionally provided a guideline to this issue, although it appears excessively pessimistic in the light of recent experimental results. In this note, Amdahl's Law is amended by giving a greater importance to the capacity of a program to make effective use of parallel processing, but also recognizing the fact that imbalance of the workload of each processor is bound to occur. An activity set model of parallel program behavior is then introduced along with the corresponding parallelism index of a program, leading to upper and lower bounds to the speed-up.

  17. Concurrent extensions to the FORTRAN language for parallel programming of computational fluid dynamics algorithms

    NASA Technical Reports Server (NTRS)

    Weeks, Cindy Lou

    1986-01-01

    Experiments were conducted at NASA Ames Research Center to define multi-tasking software requirements for multiple-instruction, multiple-data stream (MIMD) computer architectures. The focus was on specifying solutions for algorithms in the field of computational fluid dynamics (CFD). The program objectives were to allow researchers to produce usable parallel application software as soon as possible after acquiring MIMD computer equipment, to provide researchers with an easy-to-learn and easy-to-use parallel software language which could be implemented on several different MIMD machines, and to enable researchers to list preferred design specifications for future MIMD computer architectures. Analysis of CFD algorithms indicated that extensions of an existing programming language, adaptable to new computer architectures, provided the best solution to meeting program objectives. The CoFORTRAN Language was written in response to these objectives and to provide researchers a means to experiment with parallel software solutions to CFD algorithms on machines with parallel architectures.

  18. Programming language supporting first-class parallel environments. Technical report

    SciTech Connect

    Jagannathan, S.

    1989-01-01

    This thesis presents a new programming model called the symmetric model in which the representation of programs is identical to the representation of data: to specify a computation, one defines a data structure. This data structure possesses the semantics of a first-class naming environment-it defines a scope and can be used to affect the evaluation environment of other expressions. A new programming language is presented called Symmetric Lisp, based on the symmetric model. Program structures in Symmetric Lisp are considered non-strict: a program's components may be examined even as its other elements continue to evaluate. The first part of the thesis investigates the interaction of non-strictness with first-class naming environments. The second part of the thesis discusses the compilation and implementation of Symmetric Lisp. An extended type inference system is presented for first-class environments that can be used to infer the proper evaluation environment of identifiers found within the scope of environment-yielding expressions. A translation of Symmetric Lisp into a high-level data flow language is presented.

  19. Using CLIPS in the domain of knowledge-based massively parallel programming

    NASA Technical Reports Server (NTRS)

    Dvorak, Jiri J.

    1994-01-01

    The Program Development Environment (PDE) is a tool for massively parallel programming of distributed-memory architectures. Adopting a knowledge-based approach, the PDE eliminates the complexity introduced by parallel hardware with distributed memory and offers complete transparency in respect of parallelism exploitation. The knowledge-based part of the PDE is realized in CLIPS. Its principal task is to find an efficient parallel realization of the application specified by the user in a comfortable, abstract, domain-oriented formalism. A large collection of fine-grain parallel algorithmic skeletons, represented as COOL objects in a tree hierarchy, contains the algorithmic knowledge. A hybrid knowledge base with rule modules and procedural parts, encoding expertise about application domain, parallel programming, software engineering, and parallel hardware, enables a high degree of automation in the software development process. In this paper, important aspects of the implementation of the PDE using CLIPS and COOL are shown, including the embedding of CLIPS with C++-based parts of the PDE. The appropriateness of the chosen approach and of the CLIPS language for knowledge-based software engineering are discussed.

  20. Programming a massively parallel, computation universal system: static behavior

    SciTech Connect

    Lapedes, A.; Farber, R.

    1986-01-01

    In previous work by the authors, the ''optimum finding'' properties of Hopfield neural nets were applied to the nets themselves to create a ''neural compiler.'' This was done in such a way that the problem of programming the attractors of one neural net (called the Slave net) was expressed as an optimization problem that was in turn solved by a second neural net (the Master net). In this series of papers that approach is extended to programming nets that contain interneurons (sometimes called ''hidden neurons''), and thus deals with nets capable of universal computation. 22 refs.

  1. Parallel programming of saccades during natural scene viewing: evidence from eye movement positions.

    PubMed

    Wu, Esther X W; Gilani, Syed Omer; van Boxtel, Jeroen J A; Amihai, Ido; Chua, Fook Kee; Yen, Shih-Cheng

    2013-10-24

    Previous studies have shown that saccade plans during natural scene viewing can be programmed in parallel. This evidence comes mainly from temporal indicators, i.e., fixation durations and latencies. In the current study, we asked whether eye movement positions recorded during scene viewing also reflect parallel programming of saccades. As participants viewed scenes in preparation for a memory task, their inspection of the scene was suddenly disrupted by a transition to another scene. We examined whether saccades after the transition were invariably directed immediately toward the center or were contingent on saccade onset times relative to the transition. The results, which showed a dissociation in eye movement behavior between two groups of saccades after the scene transition, supported the parallel programming account. Saccades with relatively long onset times (>100 ms) after the transition were directed immediately toward the center of the scene, probably to restart scene exploration. Saccades with short onset times (<100 ms) moved to the center only one saccade later. Our data on eye movement positions provide novel evidence of parallel programming of saccades during scene viewing. Additionally, results from the analyses of intersaccadic intervals were also consistent with the parallel programming hypothesis.

  2. Method for resource control in parallel environments using program organization and run-time support

    NASA Technical Reports Server (NTRS)

    Ekanadham, Kattamuri (Inventor); Moreira, Jose Eduardo (Inventor); Naik, Vijay Krishnarao (Inventor)

    2001-01-01

    A system and method for dynamic scheduling and allocation of resources to parallel applications during the course of their execution. By establishing well-defined interactions between an executing job and the parallel system, the system and method support dynamic reconfiguration of processor partitions, dynamic distribution and redistribution of data, communication among cooperating applications, and various other monitoring actions. The interactions occur only at specific points in the execution of the program where the aforementioned operations can be performed efficiently.

  3. Method for resource control in parallel environments using program organization and run-time support

    NASA Technical Reports Server (NTRS)

    Ekanadham, Kattamuri (Inventor); Moreira, Jose Eduardo (Inventor); Naik, Vijay Krishnarao (Inventor)

    1999-01-01

    A system and method for dynamic scheduling and allocation of resources to parallel applications during the course of their execution. By establishing well-defined interactions between an executing job and the parallel system, the system and method support dynamic reconfiguration of processor partitions, dynamic distribution and redistribution of data, communication among cooperating applications, and various other monitoring actions. The interactions occur only at specific points in the execution of the program where the aforementioned operations can be performed efficiently.

  4. Describing, using 'recognition cones'. [parallel-series model with English-like computer program

    NASA Technical Reports Server (NTRS)

    Uhr, L.

    1973-01-01

    A parallel-serial 'recognition cone' model is examined, taking into account the model's ability to describe scenes of objects. An actual program is presented in an English-like language. The concept of a 'description' is discussed together with possible types of descriptive information. Questions regarding the level and the variety of detail are considered along with approaches for improving the serial representations of parallel systems.

  5. Programming environment for parallel-vision algorithms. Annual report, February 1985-February 1986

    SciTech Connect

    Brown

    1986-08-01

    During the first year of the award period, three main lines of work were pursued: systems support algorithms, Butterfly programming environment, and vision applications. Today's multiprocessor computer architectures are not efficiently programmed or even conceptualized with standard computer languages, and their operating systems and debugging tools are also challengingly different. The University of Rochester is doing work in the area of tools for controlling large-grain parallelism, as one finds in a distributed multiprocessor application like the Autonomous Land Vehicle, or in tightly coupled processors like the Hypercube or the Butterfly Parallel Processor.

  6. Performance Evaluation of Remote Memory Access (RMA) Programming on Shared Memory Parallel Computers

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Jost, Gabriele; Biegel, Bryan A. (Technical Monitor)

    2002-01-01

    The purpose of this study is to evaluate the feasibility of remote memory access (RMA) programming on shared memory parallel computers. We discuss different RMA based implementations of selected CFD application benchmark kernels and compare them to corresponding message passing based codes. For the message-passing implementation we use MPI point-to-point and global communication routines. For the RMA based approach we consider two different libraries supporting this programming model. One is a shared memory parallelization library (SMPlib) developed at NASA Ames, the other is the MPI-2 extensions to the MPI Standard. We give timing comparisons for the different implementation strategies and discuss the performance.

  7. Architecture-Adaptive Computing Environment: A Tool for Teaching Parallel Programming

    NASA Technical Reports Server (NTRS)

    Dorband, John E.; Aburdene, Maurice F.

    2002-01-01

    Recently, networked and cluster computation have become very popular. This paper is an introduction to a new C based parallel language for architecture-adaptive programming, aCe C. The primary purpose of aCe (Architecture-adaptive Computing Environment) is to encourage programmers to implement applications on parallel architectures by providing them the assurance that future architectures will be able to run their applications with a minimum of modification. A secondary purpose is to encourage computer architects to develop new types of architectures by providing an easily implemented software development environment and a library of test applications. This new language should be an ideal tool to teach parallel programming. In this paper, we will focus on some fundamental features of aCe C.

  8. Concurrent Collections (CnC): A new approach to parallel programming

    SciTech Connect

    2010-05-07

    A common approach in designing parallel languages is to provide some high level handles to manipulate the use of the parallel platform. This exposes some aspects of the target platform, for example, shared vs. distributed memory. It may expose some but not all types of parallelism, for example, data parallelism but not task parallelism. This approach must find a balance between the desire to provide a simple view for the domain expert and provide sufficient power for tuning. This is hard for any given architecture and harder if the language is to apply to a range of architectures. Either simplicity or power is lost. Instead of viewing the language design problem as one of providing the programmer with high level handles, we view the problem as one of designing an interface. On one side of this interface is the programmer (domain expert) who knows the application but needs no knowledge of any aspects of the platform. On the other side of the interface is the performance expert (programmer or program) who demands maximal flexibility for optimizing the mapping to a wide range of target platforms (parallel / serial, shared / distributed, homogeneous / heterogeneous, etc.) but needs no knowledge of the domain. Concurrent Collections (CnC) is based on this separation of concerns. The talk will present CnC and its benefits. About the speaker Kathleen Knobe has focused throughout her career on parallelism especially compiler technology, runtime system design and language design. She worked at Compass (aka Massachusetts Computer Associates) from 1980 to 1991 designing compilers for a wide range of parallel platforms for Thinking Machines, MasPar, Alliant, Numerix, and several government projects. In 1991 she decided to finish her education. After graduating from MIT in 1997, she joined Digital Equipment’s Cambridge Research Lab (CRL). She stayed through the DEC/Compaq/HP mergers and when CRL was acquired and absorbed by Intel. She currently works in the Software and

  9. LDRD final report on massively-parallel linear programming : the parPCx system.

    SciTech Connect

    Parekh, Ojas; Phillips, Cynthia Ann; Boman, Erik Gunnar

    2005-02-01

    This report summarizes the research and development performed from October 2002 to September 2004 at Sandia National Laboratories under the Laboratory-Directed Research and Development (LDRD) project ''Massively-Parallel Linear Programming''. We developed a linear programming (LP) solver designed to use a large number of processors. LP is the optimization of a linear objective function subject to linear constraints. Companies and universities have expended huge efforts over decades to produce fast, stable serial LP solvers. Previous parallel codes run on shared-memory systems and have little or no distribution of the constraint matrix. We have seen no reports of general LP solver runs on large numbers of processors. Our parallel LP code is based on an efficient serial implementation of Mehrotra's interior-point predictor-corrector algorithm (PCx). The computational core of this algorithm is the assembly and solution of a sparse linear system. We have substantially rewritten the PCx code and based it on Trilinos, the parallel linear algebra library developed at Sandia. Our interior-point method can use either direct or iterative solvers for the linear system. To achieve a good parallel data distribution of the constraint matrix, we use a (pre-release) version of a hypergraph partitioner from the Zoltan partitioning library. We describe the design and implementation of our new LP solver called parPCx and give preliminary computational results. We summarize a number of issues related to efficient parallel solution of LPs with interior-point methods including data distribution, numerical stability, and solving the core linear system using both direct and iterative methods. We describe a number of applications of LP specific to US Department of Energy mission areas and we summarize our efforts to integrate parPCx (and parallel LP solvers in general) into Sandia's massively-parallel integer programming solver PICO (Parallel Interger and Combinatorial Optimizer). We

  10. Method, systems, and computer program products for implementing function-parallel network firewall

    DOEpatents

    Fulp, Errin W.; Farley, Ryan J.

    2011-10-11

    Methods, systems, and computer program products for providing function-parallel firewalls are disclosed. According to one aspect, a function-parallel firewall includes a first firewall node for filtering received packets using a first portion of a rule set including a plurality of rules. The first portion includes less than all of the rules in the rule set. At least one second firewall node filters packets using a second portion of the rule set. The second portion includes at least one rule in the rule set that is not present in the first portion. The first and second portions together include all of the rules in the rule set.

  11. Speedup properties of phases in the execution profile of distributed parallel programs

    SciTech Connect

    Carlson, B.M.; Wagner, T.D.; Dowdy, L.W.; Worley, P.H.

    1992-08-01

    The execution profile of a distributed-memory parallel program specifies the number of busy processors as a function of time. Periods of homogeneous processor utilization are manifested in many execution profiles. These periods can usually be correlated with the algorithms implemented in the underlying parallel code. Three families of methods for smoothing execution profile data are presented. These approaches simplify the problem of detecting end points of periods of homogeneous utilization. These periods, called phases, are then examined in isolation, and their speedup characteristics are explored. A specific workload executed on an Intel iPSC/860 is used for validation of the techniques described.

  12. Programming environment for parallel vision algorithms. Annual report, February 1986-February 1987

    SciTech Connect

    Brown, C.

    1987-02-01

    During the second year of the award period, the Computer Science Department of the University of Rochester continued work in: 1) systems support algorithms, 2) the Butterfly programming environment, and 3) vision applications. This research produced several internal and external reports as well as much exportable code. The University of Rochester also employed DARPA Parallel Architecture Benchmark problems to test different algorithms using four different Butterfly programming environments. These tests produced several interesting results and demonstrated that the Butterfly architecture is a flexible general-purpose architecture that can be effectively programmed by non-experts, using tools developed at BBN and Rochester. The University of Rochester is continuing to study the issues and concerns surrounding the effective implementation of parallel algorithms.

  13. PINCA: A scalable parallel program for compressible gas dynamics with nonequilibrium chemistry

    SciTech Connect

    Wong, C.C.; Blottner, F.G.; Payne, J.L.; Soetrisno, M.; Imlay, S.T.

    1995-04-01

    This report documents an exploratory research work, funded by the Laboratory Directed Research and Development (LDRD) office at Sandia National Laboratories, to develop an advanced, general purpose, robust compressible flow solver for handling large, complex, chemically reacting gas dynamics problems. The deliverable of this project, a computer program called PINCA (Parallel INtegrated Computer Analysis) will run on massively parallel computers such as the Intel/Gamma and Intel/Paragon. With the development of this parallel compressible flow solver, engineers will be better able to address large three-dimensional scientific arid engineering problems involving multi-component gas mixtures with finite rate chemistry. These problems occur in high temperature industrial processes, combustion, and hypersonic: reentry of space-crafts.

  14. Methodologies and Tools for Tuning Parallel Programs: 80% Art, 20% Science, and 10% Luck

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Bailey, David (Technical Monitor)

    1996-01-01

    The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessors. However, without effective means to monitor (and analyze) program execution, tuning the performance of parallel programs becomes exponentially difficult as program complexity and machine size increase. In the past few years, the ubiquitous introduction of performance tuning tools from various supercomputer vendors (Intel's ParAide, TMC's PRISM, CRI's Apprentice, and Convex's CXtrace) seems to indicate the maturity of performance instrumentation/monitor/tuning technologies and vendors'/customers' recognition of their importance. However, a few important questions remain: What kind of performance bottlenecks can these tools detect (or correct)? How time consuming is the performance tuning process? What are some important technical issues that remain to be tackled in this area? This workshop reviews the fundamental concepts involved in analyzing and improving the performance of parallel and heterogeneous message-passing programs. Several alternative strategies will be contrasted, and for each we will describe how currently available tuning tools (e.g. AIMS, ParAide, PRISM, Apprentice, CXtrace, ATExpert, Pablo, IPS-2) can be used to facilitate the process. We will characterize the effectiveness of the tools and methodologies based on actual user experiences at NASA Ames Research Center. Finally, we will discuss their limitations and outline recent approaches taken by vendors and the research community to address them.

  15. Enabling Requirements-Based Programming for Highly-Dependable Complex Parallel and Distributed Systems

    NASA Technical Reports Server (NTRS)

    Hinchey, Michael G.; Rash, James L.; Rouff, Christopher A.

    2005-01-01

    The manual application of formal methods in system specification has produced successes, but in the end, despite any claims and assertions by practitioners, there is no provable relationship between a manually derived system specification or formal model and the customer's original requirements. Complex parallel and distributed system present the worst case implications for today s dearth of viable approaches for achieving system dependability. No avenue other than formal methods constitutes a serious contender for resolving the problem, and so recognition of requirements-based programming has come at a critical juncture. We describe a new, NASA-developed automated requirement-based programming method that can be applied to certain classes of systems, including complex parallel and distributed systems, to achieve a high degree of dependability.

  16. Parlin, a general microcomputer program for parallel-line analysis of bioassays.

    PubMed

    Jesty, J; Godfrey, H P

    1986-04-01

    Commonly used manual and calculator methods for analysis of clinically important parallel-line bioassays are subject to operator bias and provide neither confidence limits for the results nor any indication of their validity. To remedy this, the authors have written a general program for statistical analysis of these bioassays for the IBM Personal Computer and its compatibles. The program has been used for analysis of bioassays for specific coagulation factors and inflammatory lymphokines and for radioimmunoassays for prostaglandins. The program offers a choice of no transform, logarithmic, or logit transformation of data, which are fitted to parallel lines for standard and unknown. It analyzes the fit for parallelism and linearity with an F test, and calculates the best estimate of the result and its 95% confidence limits. Comparison of results calculated by PARLIN with those previously obtained manually shows excellent correlation (r greater than 0.99). Results obtained using PARLIN are quickly available with current assay technics and provide a complete evaluation of the bioassay at no increase in cost. PMID:3456698

  17. Dynamic programming in parallel boundary detection with application to ultrasound intima-media segmentation.

    PubMed

    Zhou, Yuan; Cheng, Xinyao; Xu, Xiangyang; Song, Enmin

    2013-12-01

    Segmentation of carotid artery intima-media in longitudinal ultrasound images for measuring its thickness to predict cardiovascular diseases can be simplified as detecting two nearly parallel boundaries within a certain distance range, when plaque with irregular shapes is not considered. In this paper, we improve the implementation of two dynamic programming (DP) based approaches to parallel boundary detection, dual dynamic programming (DDP) and piecewise linear dual dynamic programming (PL-DDP). Then, a novel DP based approach, dual line detection (DLD), which translates the original 2-D curve position to a 4-D parameter space representing two line segments in a local image segment, is proposed to solve the problem while maintaining efficiency and rotation invariance. To apply the DLD to ultrasound intima-media segmentation, it is imbedded in a framework that employs an edge map obtained from multiplication of the responses of two edge detectors with different scales and a coupled snake model that simultaneously deforms the two contours for maintaining parallelism. The experimental results on synthetic images and carotid arteries of clinical ultrasound images indicate improved performance of the proposed DLD compared to DDP and PL-DDP, with respect to accuracy and efficiency.

  18. MIST: An Open Source Environmental Modelling Programming Language Incorporating Easy to Use Data Parallelism.

    NASA Astrophysics Data System (ADS)

    Bellerby, Tim

    2014-05-01

    Model Integration System (MIST) is open-source environmental modelling programming language that directly incorporates data parallelism. The language is designed to enable straightforward programming structures, such as nested loops and conditional statements to be directly translated into sequences of whole-array (or more generally whole data-structure) operations. MIST thus enables the programmer to use well-understood constructs, directly relating to the mathematical structure of the model, without having to explicitly vectorize code or worry about details of parallelization. A range of common modelling operations are supported by dedicated language structures operating on cell neighbourhoods rather than individual cells (e.g.: the 3x3 local neighbourhood needed to implement an averaging image filter can be simply accessed from within a simple loop traversing all image pixels). This facility hides details of inter-process communication behind more mathematically relevant descriptions of model dynamics. The MIST automatic vectorization/parallelization process serves both to distribute work among available nodes and separately to control storage requirements for intermediate expressions - enabling operations on very large domains for which memory availability may be an issue. MIST is designed to facilitate efficient interpreter based implementations. A prototype open source interpreter is available, coded in standard FORTRAN 95, with tools to rapidly integrate existing FORTRAN 77 or 95 code libraries. The language is formally specified and thus not limited to FORTRAN implementation or to an interpreter-based approach. A MIST to FORTRAN compiler is under development and volunteers are sought to create an ANSI-C implementation. Parallel processing is currently implemented using OpenMP. However, parallelization code is fully modularised and could be replaced with implementations using other libraries. GPU implementation is potentially possible.

  19. Support of Multidimensional Parallelism in the OpenMP Programming Model

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Jost, Gabriele

    2003-01-01

    OpenMP is the current standard for shared-memory programming. While providing ease of parallel programming, the OpenMP programming model also has limitations which often effect the scalability of applications. Examples for these limitations are work distribution and point-to-point synchronization among threads. We propose extensions to the OpenMP programming model which allow the user to easily distribute the work in multiple dimensions and synchronize the workflow among the threads. The proposed extensions include four new constructs and the associated runtime library. They do not require changes to the source code and can be implemented based on the existing OpenMP standard. We illustrate the concept in a prototype translator and test with benchmark codes and a cloud modeling code.

  20. Development, Verification and Validation of Parallel, Scalable Volume of Fluid CFD Program for Propulsion Applications

    NASA Technical Reports Server (NTRS)

    West, Jeff; Yang, H. Q.

    2014-01-01

    There are many instances involving liquid/gas interfaces and their dynamics in the design of liquid engine powered rockets such as the Space Launch System (SLS). Some examples of these applications are: Propellant tank draining and slosh, subcritical condition injector analysis for gas generators, preburners and thrust chambers, water deluge mitigation for launch induced environments and even solid rocket motor liquid slag dynamics. Commercially available CFD programs simulating gas/liquid interfaces using the Volume of Fluid approach are currently limited in their parallel scalability. In 2010 for instance, an internal NASA/MSFC review of three commercial tools revealed that parallel scalability was seriously compromised at 8 cpus and no additional speedup was possible after 32 cpus. Other non-interface CFD applications at the time were demonstrating useful parallel scalability up to 4,096 processors or more. Based on this review, NASA/MSFC initiated an effort to implement a Volume of Fluid implementation within the unstructured mesh, pressure-based algorithm CFD program, Loci-STREAM. After verification was achieved by comparing results to the commercial CFD program CFD-Ace+, and validation by direct comparison with data, Loci-STREAM-VoF is now the production CFD tool for propellant slosh force and slosh damping rate simulations at NASA/MSFC. On these applications, good parallel scalability has been demonstrated for problems sizes of tens of millions of cells and thousands of cpu cores. Ongoing efforts are focused on the application of Loci-STREAM-VoF to predict the transient flow patterns of water on the SLS Mobile Launch Platform in order to support the phasing of water for launch environment mitigation so that vehicle determinantal effects are not realized.

  1. The FORCE: A portable parallel programming language supporting computational structural mechanics

    NASA Technical Reports Server (NTRS)

    Jordan, Harry F.; Benten, Muhammad S.; Brehm, Juergen; Ramanan, Aruna

    1989-01-01

    This project supports the conversion of codes in Computational Structural Mechanics (CSM) to a parallel form which will efficiently exploit the computational power available from multiprocessors. The work is a part of a comprehensive, FORTRAN-based system to form a basis for a parallel version of the NICE/SPAR combination which will form the CSM Testbed. The software is macro-based and rests on the force methodology developed by the principal investigator in connection with an early scientific multiprocessor. Machine independence is an important characteristic of the system so that retargeting it to the Flex/32, or any other multiprocessor on which NICE/SPAR might be imnplemented, is well supported. The principal investigator has experience in producing parallel software for both full and sparse systems of linear equations using the force macros. Other researchers have used the Force in finite element programs. It has been possible to rapidly develop software which performs at maximum efficiency on a multiprocessor. The inherent machine independence of the system also means that the parallelization will not be limited to a specific multiprocessor.

  2. Concurrent Collections (CnC): A new approach to parallel programming

    ScienceCinema

    None

    2016-07-12

    A common approach in designing parallel languages is to provide some high level handles to manipulate the use of the parallel platform. This exposes some aspects of the target platform, for example, shared vs. distributed memory. It may expose some but not all types of parallelism, for example, data parallelism but not task parallelism. This approach must find a balance between the desire to provide a simple view for the domain expert and provide sufficient power for tuning. This is hard for any given architecture and harder if the language is to apply to a range of architectures. Either simplicity or power is lost. Instead of viewing the language design problem as one of providing the programmer with high level handles, we view the problem as one of designing an interface. On one side of this interface is the programmer (domain expert) who knows the application but needs no knowledge of any aspects of the platform. On the other side of the interface is the performance expert (programmer or program) who demands maximal flexibility for optimizing the mapping to a wide range of target platforms (parallel / serial, shared / distributed, homogeneous / heterogeneous, etc.) but needs no knowledge of the domain. Concurrent Collections (CnC) is based on this separation of concerns. The talk will present CnC and its benefits. About the speaker Kathleen Knobe has focused throughout her career on parallelism especially compiler technology, runtime system design and language design. She worked at Compass (aka Massachusetts Computer Associates) from 1980 to 1991 designing compilers for a wide range of parallel platforms for Thinking Machines, MasPar, Alliant, Numerix, and several government projects. In 1991 she decided to finish her education. After graduating from MIT in 1997, she joined Digital Equipment’s Cambridge Research Lab (CRL). She stayed through the DEC/Compaq/HP mergers and when CRL was acquired and absorbed by Intel. She currently works in the Software and

  3. The neural basis of parallel saccade programming: an fMRI study.

    PubMed

    Hu, Yanbo; Walker, Robin

    2011-11-01

    The neural basis of parallel saccade programming was examined in an event-related fMRI study using a variation of the double-step saccade paradigm. Two double-step conditions were used: one enabled the second saccade to be partially programmed in parallel with the first saccade while in a second condition both saccades had to be prepared serially. The intersaccadic interval, observed in the parallel programming (PP) condition, was significantly reduced compared with latency in the serial programming (SP) condition and also to the latency of single saccades in control conditions. The fMRI analysis revealed greater activity (BOLD response) in the frontal and parietal eye fields for the PP condition compared with the SP double-step condition and when compared with the single-saccade control conditions. By contrast, activity in the supplementary eye fields was greater for the double-step condition than the single-step condition but did not distinguish between the PP and SP requirements. The role of the frontal eye fields in PP may be related to the advanced temporal preparation and increased salience of the second saccade goal that may mediate activity in other downstream structures, such as the superior colliculus. The parietal lobes may be involved in the preparation for spatial remapping, which is required in double-step conditions. The supplementary eye fields appear to have a more general role in planning saccade sequences that may be related to error monitoring and the control over the execution of the correct sequence of responses.

  4. Eighth SIAM conference on parallel processing for scientific computing: Final program and abstracts

    SciTech Connect

    1997-12-31

    This SIAM conference is the premier forum for developments in parallel numerical algorithms, a field that has seen very lively and fruitful developments over the past decade, and whose health is still robust. Themes for this conference were: combinatorial optimization; data-parallel languages; large-scale parallel applications; message-passing; molecular modeling; parallel I/O; parallel libraries; parallel software tools; parallel compilers; particle simulations; problem-solving environments; and sparse matrix computations.

  5. Facing competition: Neural mechanisms underlying parallel programming of antisaccades and prosaccades.

    PubMed

    Talanow, Tobias; Kasparbauer, Anna-Maria; Steffens, Maria; Meyhöfer, Inga; Weber, Bernd; Smyrnis, Nikolaos; Ettinger, Ulrich

    2016-08-01

    The antisaccade task is a prominent tool to investigate the response inhibition component of cognitive control. Recent theoretical accounts explain performance in terms of parallel programming of exogenous and endogenous saccades, linked to the horse race metaphor. Previous studies have tested the hypothesis of competing saccade signals at the behavioral level by selectively slowing the programming of endogenous or exogenous processes e.g. by manipulating the probability of antisaccades in an experimental block. To gain a better understanding of inhibitory control processes in parallel saccade programming, we analyzed task-related eye movements and blood oxygenation level dependent (BOLD) responses obtained using functional magnetic resonance imaging (fMRI) at 3T from 16 healthy participants in a mixed antisaccade and prosaccade task. The frequency of antisaccade trials was manipulated across blocks of high (75%) and low (25%) antisaccade frequency. In blocks with high antisaccade frequency, antisaccade latencies were shorter and error rates lower whilst prosaccade latencies were longer and error rates were higher. At the level of BOLD, activations in the task-related saccade network (left inferior parietal lobe, right inferior parietal sulcus, left precentral gyrus reaching into left middle frontal gyrus and inferior frontal junction) and deactivations in components of the default mode network (bilateral temporal cortex, ventromedial prefrontal cortex) compensated increased cognitive control demands. These findings illustrate context dependent mechanisms underlying the coordination of competing decision signals in volitional gaze control. PMID:27363008

  6. Facing competition: Neural mechanisms underlying parallel programming of antisaccades and prosaccades.

    PubMed

    Talanow, Tobias; Kasparbauer, Anna-Maria; Steffens, Maria; Meyhöfer, Inga; Weber, Bernd; Smyrnis, Nikolaos; Ettinger, Ulrich

    2016-08-01

    The antisaccade task is a prominent tool to investigate the response inhibition component of cognitive control. Recent theoretical accounts explain performance in terms of parallel programming of exogenous and endogenous saccades, linked to the horse race metaphor. Previous studies have tested the hypothesis of competing saccade signals at the behavioral level by selectively slowing the programming of endogenous or exogenous processes e.g. by manipulating the probability of antisaccades in an experimental block. To gain a better understanding of inhibitory control processes in parallel saccade programming, we analyzed task-related eye movements and blood oxygenation level dependent (BOLD) responses obtained using functional magnetic resonance imaging (fMRI) at 3T from 16 healthy participants in a mixed antisaccade and prosaccade task. The frequency of antisaccade trials was manipulated across blocks of high (75%) and low (25%) antisaccade frequency. In blocks with high antisaccade frequency, antisaccade latencies were shorter and error rates lower whilst prosaccade latencies were longer and error rates were higher. At the level of BOLD, activations in the task-related saccade network (left inferior parietal lobe, right inferior parietal sulcus, left precentral gyrus reaching into left middle frontal gyrus and inferior frontal junction) and deactivations in components of the default mode network (bilateral temporal cortex, ventromedial prefrontal cortex) compensated increased cognitive control demands. These findings illustrate context dependent mechanisms underlying the coordination of competing decision signals in volitional gaze control.

  7. Full Parallel Implementation of an All-Electron Four-Component Dirac-Kohn-Sham Program.

    PubMed

    Rampino, Sergio; Belpassi, Leonardo; Tarantelli, Francesco; Storchi, Loriano

    2014-09-01

    A full distributed-memory implementation of the Dirac-Kohn-Sham (DKS) module of the program BERTHA (Belpassi et al., Phys. Chem. Chem. Phys. 2011, 13, 12368-12394) is presented, where the self-consistent field (SCF) procedure is replicated on all the parallel processes, each process working on subsets of the global matrices. The key feature of the implementation is an efficient procedure for switching between two matrix distribution schemes, one (integral-driven) optimal for the parallel computation of the matrix elements and another (block-cyclic) optimal for the parallel linear algebra operations. This approach, making both CPU-time and memory scalable with the number of processors used, virtually overcomes at once both time and memory barriers associated with DKS calculations. Performance, portability, and numerical stability of the code are illustrated on the basis of test calculations on three gold clusters of increasing size, an organometallic compound, and a perovskite model. The calculations are performed on a Beowulf and a BlueGene/Q system.

  8. Full Parallel Implementation of an All-Electron Four-Component Dirac-Kohn-Sham Program.

    PubMed

    Rampino, Sergio; Belpassi, Leonardo; Tarantelli, Francesco; Storchi, Loriano

    2014-09-01

    A full distributed-memory implementation of the Dirac-Kohn-Sham (DKS) module of the program BERTHA (Belpassi et al., Phys. Chem. Chem. Phys. 2011, 13, 12368-12394) is presented, where the self-consistent field (SCF) procedure is replicated on all the parallel processes, each process working on subsets of the global matrices. The key feature of the implementation is an efficient procedure for switching between two matrix distribution schemes, one (integral-driven) optimal for the parallel computation of the matrix elements and another (block-cyclic) optimal for the parallel linear algebra operations. This approach, making both CPU-time and memory scalable with the number of processors used, virtually overcomes at once both time and memory barriers associated with DKS calculations. Performance, portability, and numerical stability of the code are illustrated on the basis of test calculations on three gold clusters of increasing size, an organometallic compound, and a perovskite model. The calculations are performed on a Beowulf and a BlueGene/Q system. PMID:26588521

  9. A comparison of distributed memory and virtual shared memory parallel programming models

    SciTech Connect

    Keane, J.A.; Grant, A.J.; Xu, M.Q.

    1993-04-01

    The virtues of the different parallel programming models, shared memory and distributed memory, have been much debated. Conventionally the debate could be reduced to programming convenience on the one hand, and high salability factors on the other. More recently the debate has become somewhat blurred with the provision of virtual shared memory models built on machines with physically distributed memory. The intention of such models/machines is to provide scalable shared memory, i.e. to provide both programmer convenience and high salability. In this paper, the different models are considered from experiences gained with a number of system ranging from applications in both commerce and science to languages and operating systems. Case studies are introduced as appropriate.

  10. Efficient iteration in data-parallel programs with irregular and dynamically distributed data structures

    SciTech Connect

    Littlefield, R.J.

    1990-02-01

    To implement an efficient data-parallel program on a non-shared memory MIMD multicomputer, data and computations must be properly partitioned to achieve good load balance and locality of reference. Programs with irregular data reference patterns often require irregular partitions. Although good partitions may be easy to determine, they can be difficult or impossible to implement in programming languages that provide only regular data distributions, such as blocked or cyclic arrays. We are developing Onyx, a programming system that provides a shared memory model of distributed data structures and extends the concept of data distribution to include irregular and dynamic distributions. This provides a powerful means to specify irregular partitions. Perhaps surprisingly, programs using it can also execute efficiently. In this paper, we describe and evaluate the Onyx implementation of a model problem that repeatedly executes an irregular but fixed data reference pattern. On an NCUBE hypercube, the speed of the Onyx implementation is comparable to that of carefully handwritten message-passing code.

  11. Abstract machine based execution model for computer architecture design and efficient implementation of logic programs in parallel

    SciTech Connect

    Hermenegildo, M.V.

    1986-01-01

    The term Logic Programming refers to a variety of computer languages and execution models based on the traditional concept of Symbolic Logic. The expressive power of these languages offers promise to be of great assistance in facing the programming challenges of present and future symbolic processing applications in artificial intelligence, knowledge-based systems, and many other areas of computing. This dissertation presents an efficient parallel execution model for logic programs. The model is described from the source language level down to an Abstract Machine level, suitable for direct implementation on existing parallel systems or for the design of special purpose parallel architectures. Few assumptions are made at the source language level and, therefore, the techniques developed and the general Abstract Machine design are applicable to a variety of logic (and also functional) languages. These techniques offer efficient solutions to several areas of parallel Logic Programming implementation previously considered problematic or a source of considerable overhead, such as the detection and handling of variable binding conflicts in AND-parallelism, the specification of control and management of the execution tree, the treatment of distributed backtracking, and goal scheduling and memory management issues, etc. A parallel Abstract Machine design is offered, specifying data areas, operation, and a suitable instruction set.

  12. DOE SBIR Phase-1 Report on Hybrid CPU-GPU Parallel Development of the Eulerian-Lagrangian Barracuda Multiphase Program

    SciTech Connect

    Dr. Dale M. Snider

    2011-02-28

    This report gives the result from the Phase-1 work on demonstrating greater than 10x speedup of the Barracuda computer program using parallel methods and GPU processors (General-Purpose Graphics Processing Unit or Graphics Processing Unit). Phase-1 demonstrated a 12x speedup on a typical Barracuda function using the GPU processor. The problem test case used about 5 million particles and 250,000 Eulerian grid cells. The relative speedup, compared to a single CPU, increases with increased number of particles giving greater than 12x speedup. Phase-1 work provided a path for reformatting data structure modifications to give good parallel performance while keeping a friendly environment for new physics development and code maintenance. The implementation of data structure changes will be in Phase-2. Phase-1 laid the ground work for the complete parallelization of Barracuda in Phase-2, with the caveat that implemented computer practices for parallel programming done in Phase-1 gives immediate speedup in the current Barracuda serial running code. The Phase-1 tasks were completed successfully laying the frame work for Phase-2. The detailed results of Phase-1 are within this document. In general, the speedup of one function would be expected to be higher than the speedup of the entire code because of I/O functions and communication between the algorithms. However, because one of the most difficult Barracuda algorithms was parallelized in Phase-1 and because advanced parallelization methods and proposed parallelization optimization techniques identified in Phase-1 will be used in Phase-2, an overall Barracuda code speedup (relative to a single CPU) is expected to be greater than 10x. This means that a job which takes 30 days to complete will be done in 3 days. Tasks completed in Phase-1 are: Task 1: Profile the entire Barracuda code and select which subroutines are to be parallelized (See Section Choosing a Function to Accelerate) Task 2: Select a GPU consultant company and

  13. The Fortran-P Translator: Towards Automatic Translation of Fortran 77 Programs for Massively Parallel Processors

    DOE PAGES

    O'keefe, Matthew; Parr, Terence; Edgar, B. Kevin; Anderson, Steve; Woodward, Paul; Dietz, Hank

    1995-01-01

    Massively parallel processors (MPPs) hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this article we show how applications codes written in a subset of Fortran 77, called Fortran-P, can be translated to achieve good performance on several massively parallel machines. This subset can express codes that are self-similar, where the algorithm applied to the global data domain is also applied to each subdomain. Wemore » have found many codes that match the Fortran-P programming style and have converted them using our tools. We believe a self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.« less

  14. Mobile and replicated alignment of arrays in data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert

    1993-01-01

    When a data-parallel language like FORTRAN 90 is compiled for a distributed-memory machine, aggregate data objects (such as arrays) are distributed across the processor memories. The mapping determines the amount of residual communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: first, an alignment that maps all the objects to an abstract template, and then a distribution that maps the template to the processors. We solve two facets of the problem of finding alignments that reduce residual communication: we determine alignments that vary in loops, and objects that should have replicated alignments. We show that loop-dependent mobile alignment is sometimes necessary for optimum performance, and we provide algorithms with which a compiler can determine good mobile alignments for objects within do loops. We also identify situations in which replicated alignment is either required by the program itself (via spread operations) or can be used to improve performance. We propose an algorithm based on network flow that determines which objects to replicate so as to minimize the total amount of broadcast communication in replication. This work on mobile and replicated alignment extends our earlier work on determining static alignment.

  15. Event-Based Study of the Effect of Execution Environments on Parallel Program Performance

    NASA Technical Reports Server (NTRS)

    Sarukkai, Sekhar R.; Yan, Jerry C.; Craw, James (Technical Monitor)

    1995-01-01

    In this paper we seek to demonstrate the importance of studying the effect of changes in execution environment parameters, on parallel applications executed on state-of-the-art multiprocessors. A comprehensive methodology for event-based analysis of program behavior is introduced. This methodology is used to study the performance significance of various system parameters such as processor speed, message-buffer size, buffer copy speed, network bandwidth, communication latency, interrupt overheads and other system parameters. With the help cf a few CFD examples, we illustrate the use of our technique in determining suitable parameter values of the execution environment for three applications. We also demonstrate how this approach can be used to predict performance across architectures and illustrate the use of visual and profile-like feedback to expose the effect of system parameters changes on the performance of specific applications module.

  16. Parallel conjugate gradient: effects of ordering strategies, programming paradigms, and architectural platforms

    SciTech Connect

    Oliker, L.; Li, X.; Heber, G.; Biswas, R.

    2000-05-01

    The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the floating-point operations with a CG iteration. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and SPMV using different programming and architectures. Results show that for this class of applications, ordering significantly improves overall performance, that cache reuse may be more important than reducing communication, and that it is possible to achieve message passing performance using shared memory constructs through careful data ordering and distribution. However, a multithreaded implementation of CG on the Tera MTA does not require special ordering or partitioning to obtain high efficiency and scalability.

  17. Parallel Conjugate Gradient: Effects of Ordering Strategies, Programming Paradigms, and Architectural Platforms

    NASA Technical Reports Server (NTRS)

    Oliker, Leonid; Heber, Gerd; Biswas, Rupak

    2000-01-01

    The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the floating-point operations within a CG iteration. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and SPMV using different programming paradigms and architectures. Results show that for this class of applications, ordering significantly improves overall performance, that cache reuse may be more important than reducing communication, and that it is possible to achieve message passing performance using shared memory constructs through careful data ordering and distribution. However, a multi-threaded implementation of CG on the Tera MTA does not require special ordering or partitioning to obtain high efficiency and scalability.

  18. Parallel Alternate Curriculum--A Mainstreaming Implementation Program at the Secondary Level: Alternative Teaching Strategies Combined with Basic Skills.

    ERIC Educational Resources Information Center

    Smith, Gayle

    The Parallel Alternate Curriculum (PAC), a model providng regular content courses, in regular classes, to secondary students with learning problems, combines basic skill instruction with alternative teaching strategies. PAC, a mainstreaming implementation program, is designed to provide inservice training emphasizing the process of how students…

  19. 3-D parallel program for numerical calculation of gas dynamics problems with heat conductivity on distributed memory computational systems (CS)

    SciTech Connect

    Sofronov, I.D.; Voronin, B.L.; Butnev, O.I.

    1997-12-31

    The aim of the work performed is to develop a 3D parallel program for numerical calculation of gas dynamics problem with heat conductivity on distributed memory computational systems (CS), satisfying the condition of numerical result independence from the number of processors involved. Two basically different approaches to the structure of massive parallel computations have been developed. The first approach uses the 3D data matrix decomposition reconstructed at temporal cycle and is a development of parallelization algorithms for multiprocessor CS with shareable memory. The second approach is based on using a 3D data matrix decomposition not reconstructed during a temporal cycle. The program was developed on 8-processor CS MP-3 made in VNIIEF and was adapted to a massive parallel CS Meiko-2 in LLNL by joint efforts of VNIIEF and LLNL staffs. A large number of numerical experiments has been carried out with different number of processors up to 256 and the efficiency of parallelization has been evaluated in dependence on processor number and their parameters.

  20. A component analysis based on serial results analyzing performance of parallel iterative programs

    SciTech Connect

    Richman, S.C.

    1994-12-31

    This research is concerned with the parallel performance of iterative methods for solving large, sparse, nonsymmetric linear systems. Most of the iterative methods are first presented with their time costs and convergence rates examined intensively on sequential machines, and then adapted to parallel machines. The analysis of the parallel iterative performance is more complicated than that of serial performance, since the former can be affected by many new factors, such as data communication schemes, number of processors used, and Ordering and mapping techniques. Although the author is able to summarize results from data obtained after examining certain cases by experiments, two questions remain: (1) How to explain the results obtained? (2) How to extend the results from the certain cases to general cases? To answer these two questions quantitatively, the author introduces a tool called component analysis based on serial results. This component analysis is introduced because the iterative methods consist mainly of several basic functions such as linked triads, inner products, and triangular solves, which have different intrinsic parallelisms and are suitable for different parallel techniques. The parallel performance of each iterative method is first expressed as a weighted sum of the parallel performance of the basic functions that are the components of the method. Then, one separately examines the performance of basic functions and the weighting distributions of iterative methods, from which two independent sets of information are obtained when solving a given problem. In this component approach, all the weightings require only serial costs not parallel costs, and each iterative method for solving a given problem is represented by its unique weighting distribution. The information given by the basic functions is independent of iterative method, while that given by weightings is independent of parallel technique, parallel machine and number of processors.

  1. Lazy task creation: A technique for increasing the granularity of parallel programs. Technical report

    SciTech Connect

    Mohr, E.; Kranz, D.; Halstead, R.H.

    1991-06-01

    Many parallel algorithms are naturally expressed at a fine level of granularity, often finer than a MIMD parallel system can exploit efficiently. Most builders of parallel systems have looked to either the programmer or a parallelizing compiler to increase the granularity of such algorithms. In this paper, the authors explore a third approach to the granularity problem by analyzing two strategies for combining parallel tasks dynamically at run-time. They reject the simpler load-based inlining method, where tasks are combined based on dynamic load level, in favor of the safer and more robust lazy task creation method, where tasks are created only retroactively as processing resources become available. These strategies grew out of work on Mul-T 17, an efficient parallel implementation of Scheme, but could be used with other languages as well. They describe our Mul-T implementations of lazy task creation for two contrasting machines, and present performance statistics which show the method's effectiveness. Lazy task creation allows efficient execution of naturally expressed algorithms of a substantially finer grain than possible with previous parallel Lisp systems.

  2. Lazy task creation; A technique for increasing the granularity of parallel programs

    SciTech Connect

    Mohr, E. ); Kranz, D.A. ); Halstead, R.H. Jr. )

    1991-07-01

    Many parallel algorithms are naturally expressed at a fine level of granularity, often finer than a MIMD parallel system can exploit efficiently. Most builders of parallel systems have looked to either the programmer or a parallelizing compiler to increase the granularity of such algorithms. In this paper the authors explore a third approach to the granularity problem by analyzing two strategies for combining parallel tasks dynamically at runtime. The authors reject the simpler load-based inlining method, where tasks are combined based on dynamic load level, in favor of the safer and more robust lazy task creation method, where tasks are created only retroactively as processing resources become available. These strategies grew out of work on Mul-T, an efficient parallel implementation of Scheme, but could be used with other languages as well. The authors describe our Mul-T implementations of lazy task creation for two contrasting machines, and present performance statistics which show the method's effectiveness. Lazy task creation allows efficient execution of naturally expressed algorithms of a substantially finer grain than possible with previous parallel Lisp systems.

  3. PIPS-SBB: A Parallel Distributed-Memory Branch-and-Bound Algorithm for Stochastic Mixed-Integer Programs

    DOE PAGES

    Munguia, Lluis-Miquel; Oxberry, Geoffrey; Rajan, Deepak

    2016-05-01

    Stochastic mixed-integer programs (SMIPs) deal with optimization under uncertainty at many levels of the decision-making process. When solved as extensive formulation mixed- integer programs, problem instances can exceed available memory on a single workstation. In order to overcome this limitation, we present PIPS-SBB: a distributed-memory parallel stochastic MIP solver that takes advantage of parallelism at multiple levels of the optimization process. We also show promising results on the SIPLIB benchmark by combining methods known for accelerating Branch and Bound (B&B) methods with new ideas that leverage the structure of SMIPs. Finally, we expect the performance of PIPS-SBB to improve furthermore » as more functionality is added in the future.« less

  4. Investigation of the applicability of a functional programming model to fault-tolerant parallel processing for knowledge-based systems

    NASA Technical Reports Server (NTRS)

    Harper, Richard

    1989-01-01

    In a fault-tolerant parallel computer, a functional programming model can facilitate distributed checkpointing, error recovery, load balancing, and graceful degradation. Such a model has been implemented on the Draper Fault-Tolerant Parallel Processor (FTPP). When used in conjunction with the FTPP's fault detection and masking capabilities, this implementation results in a graceful degradation of system performance after faults. Three graceful degradation algorithms have been implemented and are presented. A user interface has been implemented which requires minimal cognitive overhead by the application programmer, masking such complexities as the system's redundancy, distributed nature, variable complement of processing resources, load balancing, fault occurrence and recovery. This user interface is described and its use demonstrated. The applicability of the functional programming style to the Activation Framework, a paradigm for intelligent systems, is then briefly described.

  5. Scalable parallel programming for high performance seismic simulation on petascale heterogeneous supercomputers

    NASA Astrophysics Data System (ADS)

    Zhou, Jun

    The 1994 Northridge earthquake in Los Angeles, California, killed 57 people, injured over 8,700 and caused an estimated $20 billion in damage. Petascale simulations are needed in California and elsewhere to provide society with a better understanding of the rupture and wave dynamics of the largest earthquakes at shaking frequencies required to engineer safe structures. As the heterogeneous supercomputing infrastructures are becoming more common, numerical developments in earthquake system research are particularly challenged by the dependence on the accelerator elements to enable "the Big One" simulations with higher frequency and finer resolution. Reducing time to solution and power consumption are two primary focus area today for the enabling technology of fault rupture dynamics and seismic wave propagation in realistic 3D models of the crust's heterogeneous structure. This dissertation presents scalable parallel programming techniques for high performance seismic simulation running on petascale heterogeneous supercomputers. A real world earthquake simulation code, AWP-ODC, one of the most advanced earthquake codes to date, was chosen as the base code in this research, and the testbed is based on Titan at Oak Ridge National Laboraratory, the world's largest hetergeneous supercomputer. The research work is primarily related to architecture study, computation performance tuning and software system scalability. An earthquake simulation workflow has also been developed to support the efficient production sets of simulations. The highlights of the technical development are an aggressive performance optimization focusing on data locality and a notable data communication model that hides the data communication latency. This development results in the optimal computation efficiency and throughput for the 13-point stencil code on heterogeneous systems, which can be extended to general high-order stencil codes. Started from scratch, the hybrid CPU/GPU version of AWP

  6. Solar cell welded interconnection development program. [parallel gap and ultrasonic metal-metal bonding

    NASA Technical Reports Server (NTRS)

    Katzeff, J. S.

    1974-01-01

    Parallel gap welding and ultrasonic bonding techniques were developed for joining selected interconnect materials (silver, aluminum, copper, silver plated molybdenum and Kovar) to silver-titanium and aluminum contact cells. All process variables have been evaluated leading to establishment of optimum solar cell, interconnect, electrodes and equipment criteria for obtainment of consistent high quality welds. Applicability of nondestructive testing of solar cell welds has been studied. A pre-weld monitoring system is being built and will be utilized in the numerically controlled parallel gap weld station.

  7. Parallel implementation of inverse adding-doubling and Monte Carlo multi-layered programs for high performance computing systems with shared and distributed memory

    NASA Astrophysics Data System (ADS)

    Chugunov, Svyatoslav; Li, Changying

    2015-09-01

    Parallel implementation of two numerical tools popular in optical studies of biological materials-Inverse Adding-Doubling (IAD) program and Monte Carlo Multi-Layered (MCML) program-was developed and tested in this study. The implementation was based on Message Passing Interface (MPI) and standard C-language. Parallel versions of IAD and MCML programs were compared to their sequential counterparts in validation and performance tests. Additionally, the portability of the programs was tested using a local high performance computing (HPC) cluster, Penguin-On-Demand HPC cluster, and Amazon EC2 cluster. Parallel IAD was tested with up to 150 parallel cores using 1223 input datasets. It demonstrated linear scalability and the speedup was proportional to the number of parallel cores (up to 150x). Parallel MCML was tested with up to 1001 parallel cores using problem sizes of 104-109 photon packets. It demonstrated classical performance curves featuring communication overhead and performance saturation point. Optimal performance curve was derived for parallel MCML as a function of problem size. Typical speedup achieved for parallel MCML (up to 326x) demonstrated linear increase with problem size. Precision of MCML results was estimated in a series of tests - problem size of 106 photon packets was found optimal for calculations of total optical response and 108 photon packets for spatially-resolved results. The presented parallel versions of MCML and IAD programs are portable on multiple computing platforms. The parallel programs could significantly speed up the simulation for scientists and be utilized to their full potential in computing systems that are readily available without additional costs.

  8. Implementing the PM Programming Language using MPI and OpenMP - a New Tool for Programming Geophysical Models on Parallel Systems

    NASA Astrophysics Data System (ADS)

    Bellerby, Tim

    2015-04-01

    PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks < number of processors

  9. Strength analysis of parallel robot components in PLM Siemens NX 8.5 program

    NASA Astrophysics Data System (ADS)

    Ociepka, P.; Herbus, K.

    2015-11-01

    This article presents a series of numerical analyses in order to identify the states of stress in elements, which arise during the operation of the mechanism. The object of the research was parallel robot, which is the basis for the prototype of a driving simulator. To conduct the dynamic analysis was used the Motion Simulation module and the RecurDyn solver. In this module were created the joints which occur in the mechanism of a parallel robot. Next dynamic analyzes were performed to determine the maximal forces that will applied to the analyzed elements. It was also analyzed the platform motion during the simulation a collision of a car with a wall. In the next step, basing on the results obtained in the dynamic analysis, were performed the strength analyzes in the Advanced Simulation module. For calculation the NX Nastran solver was used.

  10. Determining the number of factors to retain: a Windows-based FORTRAN-IMSL program for parallel analysis.

    PubMed

    Kaufman, J D; Dunlap, W P

    2000-08-01

    Parallel analysis (PA; Horn, 1965) is a technique for determining the number of factors to retain in exploratory factor analysis that has been shown to be superior to more widely known methods (Zwick & Velicer, 1986). Despite its merits, PA is not widely used in the psychological literature, probably because the method is unfamiliar and because modern, Windows-compatible software to perform PA is unavailable. We provide a FORTRAN-IMSL program for PA that runs on a PC under Windows; it is interactive and designed to suit the range of problems encountered in most psychological research. Furthermore, we provide sample output from the PA program in the form of tabled values that can be used to verify the program operation; or, they can be used either directly or with interpolation to meet specific needs of the researcher.

  11. A parallel reduced-space sequential-quadratic programming algorithm for frequency-domain small animal optical tomography

    NASA Astrophysics Data System (ADS)

    Gu, Xuejun; Kim, Hyun K.; Masciotti, James; Hielscher, Andreas H.

    2009-02-01

    Computational speed and available memory size on a single processor are two limiting factors when using the frequency-domain equation of radiative transport (FD-ERT) as a forward and inverse model to reconstruct three-dimensional (3D) tomographic images. In this work, we report on a parallel, multiprocessor reducedspace sequential quadratic programming (RSQP) approach to improve computational speed and reduce memory requirement. To evaluate and quantify the performance of the code, we performed simulation studies employing a 3D numerical mouse model. Furthermore, we tested the algorithm with experimental data obtained from tumor bearing mice.

  12. Columbia University flow instability experimental program: Volume 3. Single tube parallel flow tests

    SciTech Connect

    Dougherty, T.; Maciuca, C.; McAssey, E.V. Jr.; Reddy, D.G.; Yang, B.W.

    1990-06-01

    The coolant in the Savannah River Site (SRS) production nuclear reactor assemblies is circulated as a subcooled liquid under normal operating conditions. This coolant is evenly distributed throughout multiple annular flow channels with a uniform pressure profile across each coolant flow channel. During the postulated Loss of Coolant Accident (LOCA), which is initiated by a hypothetical guillotine pipe break, the coolant flow through the reactor assemblies is significantly reduced. The flow reduction and accompanying power reduction (after shutdown is initiated) occur in the first 1--2 seconds of the LOCA. This portion of the LOCA is referred to as the Flow Instability phase. A series of down flow experiments have been conducted on three different size single tubes. The objective of these experiments was to determine the effect of a parallel flow path on the occurrence of flow instability. In all cases, it has been shown that the point of flow instability (OFI) determined under controlled flow operation does not change when operating in a controlled pressure drop mode (parallel path operation).

  13. Temperature Control with Two Parallel Small Loop Heat Pipes for GLM Program

    NASA Technical Reports Server (NTRS)

    Khrustalev, Dmitry; Stouffer, Chuck; Ku, Jentung; Hamilton, Jon; Anderson, Mark

    2014-01-01

    The concept of temperature control of an electronic component using a single Loop Heat Pipe (LHP) is well established for Aerospace applications. Using two LHPs is often desirable for redundancy/reliability reasons or for increasing the overall heat source-sink thermal conductance. This effort elaborates on temperature controlling operation of a thermal system that includes two small ammonia LHPs thermally coupled together at the evaporator end as well as at the condenser end and operating "in parallel". A transient model of the LHP system was developed on the Thermal Desktop (TradeMark) platform to understand some fundamental details of such parallel operation of the two LHPs. Extensive thermal-vacuum testing was conducted with two thermally coupled LHPs operating simultaneously as well as with only one LHP operating at a time. This paper outlines the temperature control procedures for two LHPs operating simultaneously with widely varying sink temperatures. The test data obtained during the thermal-vacuum testing, with both LHPs running simultaneously in comparison with only one LHP operating at a time, are presented with detailed explanations.

  14. Neurite, a Finite Difference Large Scale Parallel Program for the Simulation of Electrical Signal Propagation in Neurites under Mechanical Loading

    PubMed Central

    García-Grajales, Julián A.; Rucabado, Gabriel; García-Dopico, Antonio; Peña, José-María; Jérusalem, Antoine

    2015-01-01

    With the growing body of research on traumatic brain injury and spinal cord injury, computational neuroscience has recently focused its modeling efforts on neuronal functional deficits following mechanical loading. However, in most of these efforts, cell damage is generally only characterized by purely mechanistic criteria, functions of quantities such as stress, strain or their corresponding rates. The modeling of functional deficits in neurites as a consequence of macroscopic mechanical insults has been rarely explored. In particular, a quantitative mechanically based model of electrophysiological impairment in neuronal cells, Neurite, has only very recently been proposed. In this paper, we present the implementation details of this model: a finite difference parallel program for simulating electrical signal propagation along neurites under mechanical loading. Following the application of a macroscopic strain at a given strain rate produced by a mechanical insult, Neurite is able to simulate the resulting neuronal electrical signal propagation, and thus the corresponding functional deficits. The simulation of the coupled mechanical and electrophysiological behaviors requires computational expensive calculations that increase in complexity as the network of the simulated cells grows. The solvers implemented in Neurite—explicit and implicit—were therefore parallelized using graphics processing units in order to reduce the burden of the simulation costs of large scale scenarios. Cable Theory and Hodgkin-Huxley models were implemented to account for the electrophysiological passive and active regions of a neurite, respectively, whereas a coupled mechanical model accounting for the neurite mechanical behavior within its surrounding medium was adopted as a link between electrophysiology and mechanics. This paper provides the details of the parallel implementation of Neurite, along with three different application examples: a long myelinated axon, a segmented

  15. Neurite, a finite difference large scale parallel program for the simulation of electrical signal propagation in neurites under mechanical loading.

    PubMed

    García-Grajales, Julián A; Rucabado, Gabriel; García-Dopico, Antonio; Peña, José-María; Jérusalem, Antoine

    2015-01-01

    With the growing body of research on traumatic brain injury and spinal cord injury, computational neuroscience has recently focused its modeling efforts on neuronal functional deficits following mechanical loading. However, in most of these efforts, cell damage is generally only characterized by purely mechanistic criteria, functions of quantities such as stress, strain or their corresponding rates. The modeling of functional deficits in neurites as a consequence of macroscopic mechanical insults has been rarely explored. In particular, a quantitative mechanically based model of electrophysiological impairment in neuronal cells, Neurite, has only very recently been proposed. In this paper, we present the implementation details of this model: a finite difference parallel program for simulating electrical signal propagation along neurites under mechanical loading. Following the application of a macroscopic strain at a given strain rate produced by a mechanical insult, Neurite is able to simulate the resulting neuronal electrical signal propagation, and thus the corresponding functional deficits. The simulation of the coupled mechanical and electrophysiological behaviors requires computational expensive calculations that increase in complexity as the network of the simulated cells grows. The solvers implemented in Neurite--explicit and implicit--were therefore parallelized using graphics processing units in order to reduce the burden of the simulation costs of large scale scenarios. Cable Theory and Hodgkin-Huxley models were implemented to account for the electrophysiological passive and active regions of a neurite, respectively, whereas a coupled mechanical model accounting for the neurite mechanical behavior within its surrounding medium was adopted as a link between electrophysiology and mechanics. This paper provides the details of the parallel implementation of Neurite, along with three different application examples: a long myelinated axon, a segmented

  16. Portable Parallel Programming for the Dynamic Load Balancing of Unstructured Grid Applications

    NASA Technical Reports Server (NTRS)

    Biswas, Rupak; Das, Sajal K.; Harvey, Daniel; Oliker, Leonid

    1999-01-01

    The ability to dynamically adapt an unstructured -rid (or mesh) is a powerful tool for solving computational problems with evolving physical features; however, an efficient parallel implementation is rather difficult, particularly from the view point of portability on various multiprocessor platforms We address this problem by developing PLUM, tin automatic anti architecture-independent framework for adaptive numerical computations in a message-passing environment. Portability is demonstrated by comparing performance on an SP2, an Origin2000, and a T3E, without any code modifications. We also present a general-purpose load balancer that utilizes symmetric broadcast networks (SBN) as the underlying communication pattern, with a goal to providing a global view of system loads across processors. Experiments on, an SP2 and an Origin2000 demonstrate the portability of our approach which achieves superb load balance at the cost of minimal extra overhead.

  17. Parallel programming of gradient-based iterative image reconstruction schemes for optical tomography.

    PubMed

    Hielscher, Andreas H; Bartel, Sebastian

    2004-02-01

    Optical tomography (OT) is a fast developing novel imaging modality that uses near-infrared (NIR) light to obtain cross-sectional views of optical properties inside the human body. A major challenge remains the time-consuming, computational-intensive image reconstruction problem that converts NIR transmission measurements into cross-sectional images. To increase the speed of iterative image reconstruction schemes that are commonly applied for OT, we have developed and implemented several parallel algorithms on a cluster of workstations. Static process distribution as well as dynamic load balancing schemes suitable for heterogeneous clusters and varying machine performances are introduced and tested. The resulting algorithms are shown to accelerate the reconstruction process to various degrees, substantially reducing the computation times for clinically relevant problems.

  18. Generating local addresses and communication sets for data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Long, Fred J. E.; Schreiber, Robert; Teng, Shang-Hua

    1993-01-01

    Generating local addresses and communication sets is an important issue in distributed-memory implementations of data-parallel languages such as High Performance FORTRAN. We show that, for an array A affinely aligned to a template that is distributed across p processors with a cyclic(k) distribution and a computation involving the regular section A(l:h:s), the local memory access sequence for any processor is characterized by a finite state machine of at most k states. We present fast algorithms for computing the essential information about these state machines, and extend the framework to handle multidimensional arrays. We also show how to generate communication sets using the state machine approach. Performance results show that this solution requires very little run-time overhead and acceptable preprocessing time.

  19. Generating local addresses and communication sets for data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Long, Fred J. E.; Schreiber, Robert; Teng, Shang-Hua

    1993-01-01

    Generating local addresses and communication sets is an important issue in distributed-memory implementations of data-parallel languages such as High Performance Fortran. We show that for an array A affinely aligned to a template that is distributed across p processors with a cyclic(k) distribution, and a computation involving the regular section A, the local memory access sequence for any processor is characterized by a finite state machine of at most k states. We present fast algorithms for computing the essential information about these state machines, and extend the framework to handle multidimensional arrays. We also show how to generate communication sets using the state machine approach. Performance results show that this solution requires very little runtime overhead and acceptable preprocessing time.

  20. Software for numerical simulations in the field of quantum technologies based on parallel programming

    NASA Astrophysics Data System (ADS)

    Adamyan, H. H.; Adamyan, N. H.; Gevorgyan, N. T.; Gevorgyan, T. V.; Kryuchkyan, G. Yu.

    2008-05-01

    We provide a software package for numerical simulations and modeling of complex quantum systems in the presence of dissipation and decoherence for a wider class of problems in the field of quantum technologies. This software is based on the method of quantum trajectories usually used for calculations of the density matrix. An important part of this Toolkit is the universal user interface, which is based on Tool Command Language (TCL) scripting language. It is elaborated in such a manner that the system description and system parameters should not be included in the source code. The core is implemented as a generic set of C++ classes, which can be efficiently reused for modeling of a wide range of photonic systems. The code has been written so as to facilitate optimization of the performance without breaking the object-orientedness of the design. We demonstrate that this software package is very useful for high performance parallel calculations on the Cluster.

  1. Acceleration of the Geostatistical Software Library (GSLIB) by code optimization and hybrid parallel programming

    NASA Astrophysics Data System (ADS)

    Peredo, Oscar; Ortiz, Julián M.; Herrero, José R.

    2015-12-01

    The Geostatistical Software Library (GSLIB) has been used in the geostatistical community for more than thirty years. It was designed as a bundle of sequential Fortran codes, and today it is still in use by many practitioners and researchers. Despite its widespread use, few attempts have been reported in order to bring this package to the multi-core era. Using all CPU resources, GSLIB algorithms can handle large datasets and grids, where tasks are compute- and memory-intensive applications. In this work, a methodology is presented to accelerate GSLIB applications using code optimization and hybrid parallel processing, specifically for compute-intensive applications. Minimal code modifications are added decreasing as much as possible the elapsed time of execution of the studied routines. If multi-core processing is available, the user can activate OpenMP directives to speed up the execution using all resources of the CPU. If multi-node processing is available, the execution is enhanced using MPI messages between the compute nodes.Four case studies are presented: experimental variogram calculation, kriging estimation, sequential gaussian and indicator simulation. For each application, three scenarios (small, large and extra large) are tested using a desktop environment with 4 CPU-cores and a multi-node server with 128 CPU-nodes. Elapsed times, speedup and efficiency results are shown.

  2. What Multilevel Parallel Programs do when you are not Watching: A Performance Analysis Case Study Comparing MPI/OpenMP, MLP, and Nested OpenMP

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Labarta, Jesus; Gimenez, Judit

    2004-01-01

    With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors, parallel programming techniques have evolved that support parallelism beyond a single level. When comparing the performance of applications based on different programming paradigms, it is important to differentiate between the influence of the programming model itself and other factors, such as implementation specific behavior of the operating system (OS) or architectural issues. Rewriting-a large scientific application in order to employ a new programming paradigms is usually a time consuming and error prone task. Before embarking on such an endeavor it is important to determine that there is really a gain that would not be possible with the current implementation. A detailed performance analysis is crucial to clarify these issues. The multilevel programming paradigms considered in this study are hybrid MPI/OpenMP, MLP, and nested OpenMP. The hybrid MPI/OpenMP approach is based on using MPI [7] for the coarse grained parallelization and OpenMP [9] for fine grained loop level parallelism. The MPI programming paradigm assumes a private address space for each process. Data is transferred by explicitly exchanging messages via calls to the MPI library. This model was originally designed for distributed memory architectures but is also suitable for shared memory systems. The second paradigm under consideration is MLP which was developed by Taft. The approach is similar to MPi/OpenMP, using a mix of coarse grain process level parallelization and loop level OpenMP parallelization. As it is the case with MPI, a private address space is assumed for each process. The MLP approach was developed for ccNUMA architectures and explicitly takes advantage of the availability of shared memory. A shared memory arena which is accessible by all processes is required. Communication is done by reading from and writing to the shared memory.

  3. Non-collective Parallel I/O for Global Address Space Programming Models

    SciTech Connect

    Krishnamoorthy, Sriram; Piernas Canovas, Juan; Tipparaju, Vinod; Nieplocha, Jaroslaw; Sadayappan, Ponnuswamy

    2007-09-13

    Achieving high performance for out-of-core applications typically involves explicit management of the movement of data between the disk and the physical memory. We are developing a programming environment in which the different levels of the memory hierarchy are handled efficiently in a unified transparent framework. In this paper, we present our experiences with implementing efficient non-collective I/O (GPC-IO) as part of this framework. As a generalization of the Remote Procedure Call (RPC) that was used as a foundation for the Sun NFS system, we developed a global procedure call (GPC) to invoke procedures on a remote node to handle non-collective I/O. We consider alternative approaches that can be employed in implementing this functionality. The approaches are evaluated using a representative computation from quantum chemistry. The results demonstrate that GPC-IO achieves better absolute execution times, strong-scaling, and weakscaling than the alternatives considered.

  4. The Impact of Parallel Programming Models on the Performance for GeoFEM Benchmarks based on Finite-Element Applications in Geophysics

    NASA Astrophysics Data System (ADS)

    Nakajima, K.

    2006-12-01

    In order to achieve minimal parallelization overhead for symmetric multiprocessor (SMP) cluster architectures and PC clusters with multi-core processors, a multi-level hybrid programming model is often employed. The aim of this method is to combine coarse-grain (message passing) and fine-grain (thread, OpenMP) parallelism. Another often-used programming model is the single-level flat MPI model, in which separate single-threaded MPI processes are executed on each processing element (PE). Although a significant amount of research on this issue has been conducted in recent years, it remains unclear whether the performance gains of the hybrid approach compensate for the increased programming complexity. Some examples show that flat MPI is rather better, although the efficiency depends on hardware performance, features of applications, and problem size. The author has developed parallel iterative solvers with preconditioning in GeoFEM project (http://geofem.tokyo.rist.or.jp), which is a national research project for parallel FEM platform in solid earth simulation on the Earth Simulator (ES). In GeoFEM, an efficient parallel iterative method for FEM has been developed for SMP cluster architectures with vector processors such as the Earth Simulator. The method is based on a three-level hybrid parallel programming model, including message passing for inter-SMP node communication, loop directives by OpenMP for intra-SMP node parallelization and vectorization for each PE. In this presentation, parallel iterative linear solvers for unstructured grids in FEM applications, which were developed for ES, have been ported to various types of parallel computers. Performance of flat-MPI and hybrid parallel programming model has been compared for ES, Hitachi SR8000, Hitachi SR11000, IBM SP-3, IBM p5- model 595 IBM BlueGene/L Prototype, and PC clusters with AMD Opteron. Effect of coloring and method for storage of coefficient matrices have been also evaluated in GeoFEM benchmarks based

  5. Parallel Lisp simulator

    SciTech Connect

    Weening, J.S.

    1988-05-01

    CSIM is a simulator for parallel Lisp, based on a continuation passing interpreter. It models a shared-memory multiprocessor executing programs written in Common Lisp, extended with several primitives for creating and controlling processes. This paper describes the structure of the simulator, measures its performance, and gives an example of its use with a parallel Lisp program.

  6. Center for Programming Models for Scalable Parallel Computing - Towards Enhancing OpenMP for Manycore and Heterogeneous Nodes

    SciTech Connect

    Barbara Chapman

    2012-02-01

    OpenMP was not well recognized at the beginning of the project, around year 2003, because of its limited use in DoE production applications and the inmature hardware support for an efficient implementation. Yet in the recent years, it has been graduately adopted both in HPC applications, mostly in the form of MPI+OpenMP hybrid code, and in mid-scale desktop applications for scientific and experimental studies. We have observed this trend and worked deligiently to improve our OpenMP compiler and runtimes, as well as to work with the OpenMP standard organization to make sure OpenMP are evolved in the direction close to DoE missions. In the Center for Programming Models for Scalable Parallel Computing project, the HPCTools team at the University of Houston (UH), directed by Dr. Barbara Chapman, has been working with project partners, external collaborators and hardware vendors to increase the scalability and applicability of OpenMP for multi-core (and future manycore) platforms and for distributed memory systems by exploring different programming models, language extensions, compiler optimizations, as well as runtime library support.

  7. Parallel computers

    SciTech Connect

    Treveaven, P.

    1989-01-01

    This book presents an introduction to object-oriented, functional, and logic parallel computing on which the fifth generation of computer systems will be based. Coverage includes concepts for parallel computing languages, a parallel object-oriented system (DOOM) and its language (POOL), an object-oriented multilevel VLSI simulator using POOL, and implementation of lazy functional languages on parallel architectures.

  8. Eclipse Parallel Tools Platform

    2005-02-18

    Designing and developing parallel programs is an inherently complex task. Developers must choose from the many parallel architectures and programming paradigms that are available, and face a plethora of tools that are required to execute, debug, and analyze parallel programs i these environments. Few, if any, of these tools provide any degree of integration, or indeed any commonality in their user interfaces at all. This further complicates the parallel developer's task, hampering software engineering practices,more » and ultimately reducing productivity. One consequence of this complexity is that best practice in parallel application development has not advanced to the same degree as more traditional programming methodologies. The result is that there is currently no open-source, industry-strength platform that provides a highly integrated environment specifically designed for parallel application development. Eclipse is a universal tool-hosting platform that is designed to providing a robust, full-featured, commercial-quality, industry platform for the development of highly integrated tools. It provides a wide range of core services for tool integration that allow tool producers to concentrate on their tool technology rather than on platform specific issues. The Eclipse Integrated Development Environment is an open-source project that is supported by over 70 organizations, including IBM, Intel and HP. The Eclipse Parallel Tools Platform (PTP) plug-in extends the Eclipse framwork by providing support for a rich set of parallel programming languages and paradigms, and a core infrastructure for the integration of a wide variety of parallel tools. The first version of the PTP is a prototype that only provides minimal functionality for parallel tool integration of a wide variety of parallel tools. The first version of the PTP is a prototype that only provides minimal functionality for parallel tool integration, support for a small number of parallel architectures

  9. Managing Algorithmic Skeleton Nesting Requirements in Realistic Image Processing Applications: The Case of the SKiPPER-II Parallel Programming Environment's Operating Model

    NASA Astrophysics Data System (ADS)

    Coudarcher, Rémi; Duculty, Florent; Serot, Jocelyn; Jurie, Frédéric; Derutin, Jean-Pierre; Dhome, Michel

    2005-12-01

    SKiPPER is a SKeleton-based Parallel Programming EnviRonment being developed since 1996 and running at LASMEA Laboratory, the Blaise-Pascal University, France. The main goal of the project was to demonstrate the applicability of skeleton-based parallel programming techniques to the fast prototyping of reactive vision applications. This paper deals with the special features embedded in the latest version of the project: algorithmic skeleton nesting capabilities and a fully dynamic operating model. Throughout the case study of a complete and realistic image processing application, in which we have pointed out the requirement for skeleton nesting, we are presenting the operating model of this feature. The work described here is one of the few reported experiments showing the application of skeleton nesting facilities for the parallelisation of a realistic application, especially in the area of image processing. The image processing application we have chosen is a 3D face-tracking algorithm from appearance.

  10. Survey of new vector computers: The CRAY 1S from CRAY research; the CYBER 205 from CDC and the parallel computer from ICL - architecture and programming

    NASA Technical Reports Server (NTRS)

    Gentzsch, W.

    1982-01-01

    Problems which can arise with vector and parallel computers are discussed in a user oriented context. Emphasis is placed on the algorithms used and the programming techniques adopted. Three recently developed supercomputers are examined and typical application examples are given in CRAY FORTRAN, CYBER 205 FORTRAN and DAP (distributed array processor) FORTRAN. The systems performance is compared. The addition of parts of two N x N arrays is considered. The influence of the architecture on the algorithms and programming language is demonstrated. Numerical analysis of magnetohydrodynamic differential equations by an explicit difference method is illustrated, showing very good results for all three systems. The prognosis for supercomputer development is assessed.

  11. Parallel computation and computers for artificial intelligence

    SciTech Connect

    Kowalik, J.S. )

    1988-01-01

    This book discusses Parallel Processing in Artificial Intelligence; Parallel Computing using Multilisp; Execution of Common Lisp in a Parallel Environment; Qlisp; Restricted AND-Parallel Execution of Logic Programs; PARLOG: Parallel Programming in Logic; and Data-driven Processing of Semantic Nets. Attention is also given to: Application of the Butterfly Parallel Processor in Artificial Intelligence; On the Range of Applicability of an Artificial Intelligence Machine; Low-level Vision on Warp and the Apply Programming Mode; AHR: A Parallel Computer for Pure Lisp; FAIM-1: An Architecture for Symbolic Multi-processing; and Overview of Al Application Oriented Parallel Processing Research in Japan.

  12. Parallel rendering

    NASA Technical Reports Server (NTRS)

    Crockett, Thomas W.

    1995-01-01

    This article provides a broad introduction to the subject of parallel rendering, encompassing both hardware and software systems. The focus is on the underlying concepts and the issues which arise in the design of parallel rendering algorithms and systems. We examine the different types of parallelism and how they can be applied in rendering applications. Concepts from parallel computing, such as data decomposition, task granularity, scalability, and load balancing, are considered in relation to the rendering problem. We also explore concepts from computer graphics, such as coherence and projection, which have a significant impact on the structure of parallel rendering algorithms. Our survey covers a number of practical considerations as well, including the choice of architectural platform, communication and memory requirements, and the problem of image assembly and display. We illustrate the discussion with numerous examples from the parallel rendering literature, representing most of the principal rendering methods currently used in computer graphics.

  13. Parallel computing works

    SciTech Connect

    Not Available

    1991-10-23

    An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of many computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.

  14. A systolic array parallelizing compiler

    SciTech Connect

    Tseng, P.S. )

    1990-01-01

    This book presents a completely new approach to the problem of systolic array parallelizing compiler. It describes the AL parallelizing compiler for the Warp systolic array, the first working systolic array parallelizing compiler which can generate efficient parallel code for complete LINPACK routines. This book begins by analyzing the architectural strength of the Warp systolic array. It proposes a model for mapping programs onto the machine and introduces the notion of data relations for optimizing the program mapping. Also presented are successful applications of the AL compiler in matrix computation and image processing. A complete listing of the source program and compiler-generated parallel code are given to clarify the overall picture of the compiler. The book concludes that systolic array parallelizing compiler can produce efficient parallel code, almost identical to what the user would have written by hand.

  15. mm_par2.0: An object-oriented molecular dynamics simulation program parallelized using a hierarchical scheme with MPI and OPENMP

    NASA Astrophysics Data System (ADS)

    Oh, Kwang Jin; Kang, Ji Hoon; Myung, Hun Joo

    2012-02-01

    We have revised a general purpose parallel molecular dynamics simulation program mm_par using the object-oriented programming. We parallelized the revised version using a hierarchical scheme in order to utilize more processors for a given system size. The benchmark result will be presented here. New version program summaryProgram title: mm_par2.0 Catalogue identifier: ADXP_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADXP_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC license, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 2 390 858 No. of bytes in distributed program, including test data, etc.: 25 068 310 Distribution format: tar.gz Programming language: C++ Computer: Any system operated by Linux or Unix Operating system: Linux Classification: 7.7 External routines: We provide wrappers for FFTW [1], Intel MKL library [2] FFT routine, and Numerical recipes [3] FFT, random number generator, and eigenvalue solver routines, SPRNG [4] random number generator, Mersenne Twister [5] random number generator, space filling curve routine. Catalogue identifier of previous version: ADXP_v1_0 Journal reference of previous version: Comput. Phys. Comm. 174 (2006) 560 Does the new version supersede the previous version?: Yes Nature of problem: Structural, thermodynamic, and dynamical properties of fluids and solids from microscopic scales to mesoscopic scales. Solution method: Molecular dynamics simulation in NVE, NVT, and NPT ensemble, Langevin dynamics simulation, dissipative particle dynamics simulation. Reasons for new version: First, object-oriented programming has been used, which is known to be open for extension and closed for modification. It is also known to be better for maintenance. Second, version 1.0 was based on atom decomposition and domain decomposition scheme [6] for parallelization. However, atom

  16. Massively parallel visualization: Parallel rendering

    SciTech Connect

    Hansen, C.D.; Krogh, M.; White, W.

    1995-12-01

    This paper presents rendering algorithms, developed for massively parallel processors (MPPs), for polygonal, spheres, and volumetric data. The polygon algorithm uses a data parallel approach whereas the sphere and volume renderer use a MIMD approach. Implementations for these algorithms are presented for the Thinking Machines Corporation CM-5 MPP.

  17. Efficient Parallel All-Electron Four-Component Dirac-Kohn-Sham Program Using a Distributed Matrix Approach II.

    PubMed

    Storchi, Loriano; Rampino, Sergio; Belpassi, Leonardo; Tarantelli, Francesco; Quiney, Harry M

    2013-12-10

    We propose a new complete memory-distributed algorithm, which significantly improves the parallel implementation of the all-electron four-component Dirac-Kohn-Sham (DKS) module of BERTHA (J. Chem. Theory Comput. 2010, 6, 384). We devised an original procedure for mapping the DKS matrix between an efficient integral-driven distribution, guided by the structure of specific G-spinor basis sets and by density fitting algorithms, and the two-dimensional block-cyclic distribution scheme required by the ScaLAPACK library employed for the linear algebra operations. This implementation, because of the efficiency in the memory distribution, represents a leap forward in the applicability of the DKS procedure to arbitrarily large molecular systems and its porting on last-generation massively parallel systems. The performance of the code is illustrated by some test calculations on several gold clusters of increasing size. The DKS self-consistent procedure has been explicitly converged for two representative clusters, namely Au20 and Au34, for which the density of electronic states is reported and discussed. The largest gold cluster uses more than 39k basis functions and DKS matrices of the order of 23 GB. PMID:26592273

  18. A program system for ab initio MO calculations on vector and parallel processing machines III. Integral reordering and four-index transformation

    NASA Astrophysics Data System (ADS)

    Wiest, Roland; Demuynck, Jean; Bénard, Marc; Rohmer, Marie-Madeleine; Ernenwein, René

    1991-01-01

    This series of three papers presents a program system for ab initio molecular orbital calculations on vector and parallel computers. Part III is devoted to the four-index transformation on a molecular orbital basis of size NMO of the file of two-electron integrals ( pq∥ rs) generated by a contracted Gaussian set of size NATO (number of atomic orbitals). A fast Yoshimine algorithm first sorts the ( pq∥ rs) integrals with respect to index pq only. This file of half-sorted integrals labelled by their rs-index can be processed without further modification to generate either the transformed integrals or the supermatrix elements. The large memory available on the CRAY-2 has made possible to implement the transformation algorithm proposed by Bender in 1972, which requires a core-storage allocation varying as (NATO) 3. Two versions of Bender's algorithm are included in the present program. The first version is an in-core version, where the complete file of accumulated contributions to transformed integrals is stored and updated in central memory. This version has been parallelized by distributing over a limited number of logical tasks the NATO steps corresponding to the scanning of the most external loop. The second version is an out-of-core version, in which twin fires are alternatively used as input and output for the accumulated contributions to transformed integrals. This version is not parallel. The choice of one or another version and (for version 1) the determination of the number of tasks depends upon the balance between the available and the requested amounts of storage. The storage management and the choice of the proper version are carried out automatically using dynamic storage allocation. Both versions are vectorized and take advantage of the molecular symmetry.

  19. Parallel Pascal - An extended Pascal for parallel computers

    NASA Technical Reports Server (NTRS)

    Reeves, A. P.

    1984-01-01

    Parallel Pascal is an extended version of the conventional serial Pascal programming language which includes a convenient syntax for specifying array operations. It is upward compatible with standard Pascal and involves only a small number of carefully chosen new features. Parallel Pascal was developed to reduce the semantic gap between standard Pascal and a large range of highly parallel computers. Two important design goals of Parallel Pascal were efficiency and portability. Portability is particularly difficult to achieve since different parallel computers frequently have very different capabilities.

  20. Parallel computation with the force

    NASA Technical Reports Server (NTRS)

    Jordan, H. F.

    1985-01-01

    A methodology, called the force, supports the construction of programs to be executed in parallel by a force of processes. The number of processes in the force is unspecified, but potentially very large. The force idea is embodied in a set of macros which produce multiproceossor FORTRAN code and has been studied on two shared memory multiprocessors of fairly different character. The method has simplified the writing of highly parallel programs within a limited class of parallel algorithms and is being extended to cover a broader class. The individual parallel constructs which comprise the force methodology are discussed. Of central concern are their semantics, implementation on different architectures and performance implications.

  1. Final Report, Center for Programming Models for Scalable Parallel Computing: Co-Array Fortran, Grant Number DE-FC02-01ER25505

    SciTech Connect

    Robert W. Numrich

    2008-04-22

    The major accomplishment of this project is the production of CafLib, an 'object-oriented' parallel numerical library written in Co-Array Fortran. CafLib contains distributed objects such as block vectors and block matrices along with procedures, attached to each object, that perform basic linear algebra operations such as matrix multiplication, matrix transpose and LU decomposition. It also contains constructors and destructors for each object that hide the details of data decomposition from the programmer, and it contains collective operations that allow the programmer to calculate global reductions, such as global sums, global minima and global maxima, as well as vector and matrix norms of several kinds. CafLib is designed to be extensible in such a way that programmers can define distributed grid and field objects, based on vector and matrix objects from the library, for finite difference algorithms to solve partial differential equations. A very important extra benefit that resulted from the project is the inclusion of the co-array programming model in the next Fortran standard called Fortran 2008. It is the first parallel programming model ever included as a standard part of the language. Co-arrays will be a supported feature in all Fortran compilers, and the portability provided by standardization will encourage a large number of programmers to adopt it for new parallel application development. The combination of object-oriented programming in Fortran 2003 with co-arrays in Fortran 2008 provides a very powerful programming model for high-performance scientific computing. Additional benefits from the project, beyond the original goal, include a programto provide access to the co-array model through access to the Cray compiler as a resource for teaching and research. Several academics, for the first time, included the co-array model as a topic in their courses on parallel computing. A separate collaborative project with LANL and PNNL showed how to extend the

  2. Computer program MCAP provides for steady state thermal and flow analysis of multiple parallel channels in heat generating solid

    NASA Technical Reports Server (NTRS)

    Pierce, B. L.

    1967-01-01

    Computer program /MCAP/ calculates the temperature distribution in a heat generating solid complicated by nonuniform power and flow distributions between multiple channels. It determines the channel diameters coefficients, the effects of tolerences, the pressure drop at a given flowrate, or the flowrate for a specific pressure drop.

  3. Playable Serious Games for Studying and Programming Computational STEM and Informatics Applications of Distributed and Parallel Computer Architectures

    ERIC Educational Resources Information Center

    Amenyo, John-Thones

    2012-01-01

    Carefully engineered playable games can serve as vehicles for students and practitioners to learn and explore the programming of advanced computer architectures to execute applications, such as high performance computing (HPC) and complex, inter-networked, distributed systems. The article presents families of playable games that are grounded in…

  4. Parallel pipelining

    SciTech Connect

    Joseph, D.D.; Bai, R.; Liao, T.Y.; Huang, A.; Hu, H.H.

    1995-09-01

    In this paper the authors introduce the idea of parallel pipelining for water lubricated transportation of oil (or other viscous material). A parallel system can have major advantages over a single pipe with respect to the cost of maintenance and continuous operation of the system, to the pressure gradients required to restart a stopped system and to the reduction and even elimination of the fouling of pipe walls in continuous operation. The authors show that the action of capillarity in small pipes is more favorable for restart than in large pipes. In a parallel pipeline system, they estimate the number of small pipes needed to deliver the same oil flux as in one larger pipe as N = (R/r){sup {alpha}}, where r and R are the radii of the small and large pipes, respectively, and {alpha} = 4 or 19/7 when the lubricating water flow is laminar or turbulent.

  5. High performance parallel architectures

    SciTech Connect

    Anderson, R.E. )

    1989-09-01

    In this paper the author describes current high performance parallel computer architectures. A taxonomy is presented to show computer architecture from the user programmer's point-of-view. The effects of the taxonomy upon the programming model are described. Some current architectures are described with respect to the taxonomy. Finally, some predictions about future systems are presented. 5 refs., 1 fig.

  6. Parallel Quality Assessment of Emergency Departments by European Foundation for Quality Management Model and Iranian National Program for Hospital Evaluation

    PubMed Central

    IMANI NASAB, Mohammad Hasan; MOHAGHEGH, Bahram; KHALESI, Nader; JAAFARIPOOYAN, Ebrahim

    2013-01-01

    Background European Foundation for Quality Management (EFQM) model is a widely used quality management system (QMS) worldwide, including Iran. Current study aims to verify the quality assessment results of Iranian National Program for Hospital Evaluation (INPHE) based on those of EFQM. Methods: This cross-sectional study was conducted in 2012 on a sample of emergency departments (EDs) affiliated with Tehran University of Medical Sciences (TUMS), Iran. The standard questionnaire of EFQM (V-2010) was used to gather appropriate data. The results were compared with those of INPHE. MS Excel was used to classify and display the findings. Results: The average assessment score of the EDs based on the INPHE and EFQM model were largely different (i.e. 86.4% and 31%, respectively). In addition, the variation range among five EDs’ scores according to each model was also considerable (22% for EFQM against 7% of INPHE), especially in the EDs with and without prior record of applying QMSs. Conclusion: The INPHE’s assessment results were not confirmed by EFQM model. Moreover, the higher variation range among EDs’ scores using EFQM model could allude to its more differentiation power in assessing the performance comparing with INPHE. Therefore, a need for improvement in the latter drawing on other QMSs’ (such as EFQM) strengths, given the results emanated from its comparison with EFQM seems indispensable. PMID:23967429

  7. A program system for ab initio MO calculations on vector and parallel processing machines II. SCF closed-shell and open-shell iterations

    NASA Astrophysics Data System (ADS)

    Rohmer, Marie-Madeleine; Demuynck, Jean; Bénard, Marc; Wiest, Roland; Bachmann, Christian; Henriet, Charles; Ernenwein, René

    1990-08-01

    This series of three papers presents a program system for ab initio molecular orbital calculations on vector and parallel computers. Part II is devoted to SCF iterations on closed-shell and open-shell configurations starting from a file of two-electron integrals on the basis of contracted Gaussians (CGTOs). In a preliminary step, the two-electron integrals ( pq‖ rs) are reordered according to increasing values of index pq = p( p-1)/2+ q. Then, in the first SCF iteration step, a file of semi-ordered P supermatrix elements (or P and Q supermatrix elements in the open-shell case) is generated from the file of semi-ordered integrals. This file is processed at each iteration step in an efficient vector loop to generate the electron repulsion matrix. Convergence is automatically controlled through level-shifting techniques. The most time-consuming parts are the integral sorting and the generation of the P supermatrix, which are carried out only once. Subsequent SCF iteration steps respectively require 0.9 s and 1.7 s for the process of 10 7 supermatrix elements by the closed-shell and the open-shell programs on a CRAY-2 processor.

  8. Internal combustion engine control for series hybrid electric vehicles by parallel and distributed genetic programming/multiobjective genetic algorithms

    NASA Astrophysics Data System (ADS)

    Gladwin, D.; Stewart, P.; Stewart, J.

    2011-02-01

    This article addresses the problem of maintaining a stable rectified DC output from the three-phase AC generator in a series-hybrid vehicle powertrain. The series-hybrid prime power source generally comprises an internal combustion (IC) engine driving a three-phase permanent magnet generator whose output is rectified to DC. A recent development has been to control the engine/generator combination by an electronically actuated throttle. This system can be represented as a nonlinear system with significant time delay. Previously, voltage control of the generator output has been achieved by model predictive methods such as the Smith Predictor. These methods rely on the incorporation of an accurate system model and time delay into the control algorithm, with a consequent increase in computational complexity in the real-time controller, and as a necessity relies to some extent on the accuracy of the models. Two complementary performance objectives exist for the control system. Firstly, to maintain the IC engine at its optimal operating point, and secondly, to supply a stable DC supply to the traction drive inverters. Achievement of these goals minimises the transient energy storage requirements at the DC link, with a consequent reduction in both weight and cost. These objectives imply constant velocity operation of the IC engine under external load disturbances and changes in both operating conditions and vehicle speed set-points. In order to achieve these objectives, and reduce the complexity of implementation, in this article a controller is designed by the use of Genetic Programming methods in the Simulink modelling environment, with the aim of obtaining a relatively simple controller for the time-delay system which does not rely on the implementation of real time system models or time delay approximations in the controller. A methodology is presented to utilise the miriad of existing control blocks in the Simulink libraries to automatically evolve optimal control

  9. File concepts for parallel I/O

    NASA Technical Reports Server (NTRS)

    Crockett, Thomas W.

    1989-01-01

    The subject of input/output (I/O) was often neglected in the design of parallel computer systems, although for many problems I/O rates will limit the speedup attainable. The I/O problem is addressed by considering the role of files in parallel systems. The notion of parallel files is introduced. Parallel files provide for concurrent access by multiple processes, and utilize parallelism in the I/O system to improve performance. Parallel files can also be used conventionally by sequential programs. A set of standard parallel file organizations is proposed, organizations are suggested, using multiple storage devices. Problem areas are also identified and discussed.

  10. Optimizing parallel reduction operations

    SciTech Connect

    Denton, S.M.

    1995-06-01

    A parallel program consists of sets of concurrent and sequential tasks. Often, a reduction (such as array sum) sequentially combines values produced by a parallel computation. Because reductions occur so frequently in otherwise parallel programs, they are good candidates for optimization. Since reductions may introduce dependencies, most languages separate computation and reduction. The Sisal functional language is unique in that reduction is a natural consequence of loop expressions; the parallelism is implicit in the language. Unfortunately, the original language supports only seven reduction operations. To generalize these expressions, the Sisal 90 definition adds user-defined reductions at the language level. Applicable optimizations depend upon the mathematical properties of the reduction. Compilation and execution speed, synchronization overhead, memory use and maximum size influence the final implementation. This paper (1) Defines reduction syntax and compares with traditional concurrent methods; (2) Defines classes of reduction operations; (3) Develops analysis of classes for optimized concurrency; (4) Incorporates reductions into Sisal 1.2 and Sisal 90; (5) Evaluates performance and size of the implementations.

  11. Parallelizing OVERFLOW: Experiences, Lessons, Results

    NASA Technical Reports Server (NTRS)

    Jespersen, Dennis C.

    1999-01-01

    The computer code OVERFLOW is widely used in the aerodynamic community for the numerical solution of the Navier-Stokes equations. Current trends in computer systems and architectures are toward multiple processors and parallelism, including distributed memory. This report describes work that has been carried out by the author and others at Ames Research Center with the goal of parallelizing OVERFLOW using a variety of parallel architectures and parallelization strategies. This paper begins with a brief description of the OVERFLOW code. This description includes the basic numerical algorithm and some software engineering considerations. Next comes a description of a parallel version of OVERFLOW, OVERFLOW/PVM, using PVM (Parallel Virtual Machine). This parallel version of OVERFLOW uses the manager/worker style and is part of the standard OVERFLOW distribution. Then comes a description of a parallel version of OVERFLOW, OVERFLOW/MPI, using MPI (Message Passing Interface). This parallel version of OVERFLOW uses the SPMD (Single Program Multiple Data) style. Finally comes a discussion of alternatives to explicit message-passing in the context of parallelizing OVERFLOW.

  12. The NAS Parallel Benchmarks

    SciTech Connect

    Bailey, David H.

    2009-11-15

    The NAS Parallel Benchmarks (NPB) are a suite of parallel computer performance benchmarks. They were originally developed at the NASA Ames Research Center in 1991 to assess high-end parallel supercomputers. Although they are no longer used as widely as they once were for comparing high-end system performance, they continue to be studied and analyzed a great deal in the high-performance computing community. The acronym 'NAS' originally stood for the Numerical Aeronautical Simulation Program at NASA Ames. The name of this organization was subsequently changed to the Numerical Aerospace Simulation Program, and more recently to the NASA Advanced Supercomputing Center, although the acronym remains 'NAS.' The developers of the original NPB suite were David H. Bailey, Eric Barszcz, John Barton, David Browning, Russell Carter, LeoDagum, Rod Fatoohi, Samuel Fineberg, Paul Frederickson, Thomas Lasinski, Rob Schreiber, Horst Simon, V. Venkatakrishnan and Sisira Weeratunga. The original NAS Parallel Benchmarks consisted of eight individual benchmark problems, each of which focused on some aspect of scientific computing. The principal focus was in computational aerophysics, although most of these benchmarks have much broader relevance, since in a much larger sense they are typical of many real-world scientific computing applications. The NPB suite grew out of the need for a more rational procedure to select new supercomputers for acquisition by NASA. The emergence of commercially available highly parallel computer systems in the late 1980s offered an attractive alternative to parallel vector supercomputers that had been the mainstay of high-end scientific computing. However, the introduction of highly parallel systems was accompanied by a regrettable level of hype, not only on the part of the commercial vendors but even, in some cases, by scientists using the systems. As a result, it was difficult to discern whether the new systems offered any fundamental performance advantage

  13. Parallel Information Processing.

    ERIC Educational Resources Information Center

    Rasmussen, Edie M.

    1992-01-01

    Examines parallel computer architecture and the use of parallel processors for text. Topics discussed include parallel algorithms; performance evaluation; parallel information processing; parallel access methods for text; parallel and distributed information retrieval systems; parallel hardware for text; and network models for information…

  14. EDDY RESOLVING NUTRIENT ECODYNAMICS IN THE GLOBAL PARALLEL OCEAN PROGRAM AND CONNECTIONS WITH TRACE GASES IN THE SULFUR, HALOGEN AND NMHC CYCLES

    SciTech Connect

    S. CHU; S. ELLIOTT

    2000-08-01

    Ecodynamics and the sea-air transfer of climate relevant trace gases are intimately coupled in the oceanic mixed layer. Ventilation of species such as dimethyl sulfide and methyl bromide constitutes a key linkage within the earth system. We are creating a research tool for the study of marine trace gas distributions by implementing coupled ecology-gas chemistry in the high resolution Parallel Ocean Program (POP). The fundamental circulation model is eddy resolving, with cell sizes averaging 0.15 degree (lat/long). Here we describe ecochemistry integration. Density dependent mortality and iron geochemistry have enhanced agreement with chlorophyll measurements. Indications are that dimethyl sulfide production rates must be adjusted for latitude dependence to match recent compilations. This may reflect the need for phytoplankton to conserve nitrogen by favoring sulfurous osmolytes. Global simulations are also available for carbonyl sulfide, the methyl halides and for nonmethane hydrocarbons. We discuss future applications including interaction with atmospheric chemistry models, high resolution biogeochemical snapshots and the study of open ocean fertilization.

  15. Parallel incremental compilation. Doctoral thesis

    SciTech Connect

    Gafter, N.M.

    1990-06-01

    The time it takes to compile a large program has been a bottleneck in the software development process. When an interactive programming environment with an incremental compiler is used, compilation speed becomes even more important, but existing incremental compilers are very slow for some types of program changes. We describe a set of techniques that enable incremental compilation to exploit fine-grained concurrency in a shared-memory multi-processor and achieve asymptotic improvement over sequential algorithms. Because parallel non-incremental compilation is a special case of parallel incremental compilation, the design of a parallel compiler is a corollary of our result. Instead of running the individual phases concurrently, our design specifies compiler phases that are mutually sequential. However, each phase is designed to exploit fine-grained parallelism. By allowing each phase to present its output as a complete structure rather than as a stream of data, we can apply techniques such as parallel prefix and parallel divide-and-conquer, and we can construct applicative data structures to achieve sublinear execution time. Parallel algorithms for each phase of a compiler are presented to demonstrate that a complete incremental compiler can achieve execution time that is asymptotically less than sequential algorithms.

  16. EFFICIENT SCHEDULING OF PARALLEL JOBS ON MASSIVELY PARALLEL SYSTEMS

    SciTech Connect

    F. PETRINI; W. FENG

    1999-09-01

    We present buffered coscheduling, a new methodology to multitask parallel jobs in a message-passing environment and to develop parallel programs that can pave the way to the efficient implementation of a distributed operating system. Buffered coscheduling is based on three innovative techniques: communication buffering, strobing, and non-blocking communication. By leveraging these techniques, we can perform effective optimizations based on the global status of the parallel machine rather than on the limited knowledge available locally to each processor. The advantages of buffered coscheduling include higher resource utilization, reduced communication overhead, efficient implementation of low-control strategies and fault-tolerant protocols, accurate performance modeling, and a simplified yet still expressive parallel programming model. Preliminary experimental results show that buffered coscheduling is very effective in increasing the overall performance in the presence of load imbalance and communication-intensive workloads.

  17. Parallel machine architecture and compiler design facilities

    NASA Technical Reports Server (NTRS)

    Kuck, David J.; Yew, Pen-Chung; Padua, David; Sameh, Ahmed; Veidenbaum, Alex

    1990-01-01

    The objective is to provide an integrated simulation environment for studying and evaluating various issues in designing parallel systems, including machine architectures, parallelizing compiler techniques, and parallel algorithms. The status of Delta project (which objective is to provide a facility to allow rapid prototyping of parallelized compilers that can target toward different machine architectures) is summarized. Included are the surveys of the program manipulation tools developed, the environmental software supporting Delta, and the compiler research projects in which Delta has played a role.

  18. Software For Diagnosis Of Parallel Processing

    NASA Technical Reports Server (NTRS)

    Hontalas, Philip; Yan, Jerry; Fineman, Charles

    1995-01-01

    Ames Instrumentation System (AIMS) computer program package of software tools measuring and analyzing performances of parallel-processing application programs. Helps programmer to debug and refine, and to monitor and visualize execution of, parallel-processing application software for Intel iPSC/860 (or equivalent) multicomputer. Performance data collected displayed graphically on computer workstations supporting X-Windows.

  19. Gang scheduling a parallel machine

    SciTech Connect

    Gorda, B.C.; Brooks, E.D. III.

    1991-03-01

    Program development on parallel machines can be a nightmare of scheduling headaches. We have developed a portable time sharing mechanism to handle the problem of scheduling gangs of processors. User program and their gangs of processors are put to sleep and awakened by the gang scheduler to provide a time sharing environment. Time quantums are adjusted according to priority queues and a system of fair share accounting. The initial platform for this software is the 128 processor BBN TC2000 in use in the Massively Parallel Computing Initiative at the Lawrence Livermore National Laboratory. 2 refs., 1 fig.

  20. Special parallel processing workshop

    SciTech Connect

    1994-12-01

    This report contains viewgraphs from the Special Parallel Processing Workshop. These viewgraphs deal with topics such as parallel processing performance, message passing, queue structure, and other basic concept detailing with parallel processing.

  1. Parallel algorithms for matrix computations

    SciTech Connect

    Plemmons, R.J.

    1990-01-01

    The present conference on parallel algorithms for matrix computations encompasses both shared-memory systems and distributed-memory systems, as well as combinations of the two, to provide an overall perspective on parallel algorithms for both dense and sparse matrix computations in solving systems of linear equations, dense or structured problems related to least-squares computations, eigenvalue computations, singular-value computations, and rapid elliptic solvers. Specific issues addressed include the influence of parallel and vector architectures on algorithm design, computations for distributed-memory architectures such as hypercubes, solutions for sparse symmetric positive definite linear systems, symbolic and numeric factorizations, and triangular solutions. Also addressed are reference sources for parallel and vector numerical algorithms, sources for machine architectures, and sources for programming languages.

  2. Implementation and performance of parallelized elegant.

    SciTech Connect

    Wang, Y.; Borland, M.; Accelerator Systems Division

    2008-01-01

    The program elegant is widely used for design and modeling of linacs for free-electron lasers and energy recovery linacs, as well as storage rings and other applications. As part of a multi-year effort, we have parallelized many aspects of the code, including single-particle dynamics, wakefields, and coherent synchrotron radiation. We report on the approach used for gradual parallelization, which proved very beneficial in getting parallel features into the hands of users quickly. We also report details of parallelization of collective effects. Finally, we discuss performance of the parallelized code in various applications.

  3. Data-parallel algorithms for image computing

    NASA Astrophysics Data System (ADS)

    Carlotto, Mark J.

    1990-11-01

    Data-parallel algorithms for image computing on the Connection Machine are described. After a brief review of some basic programming concepts in *Lip, a parallel extension of Common Lisp, data-parallel programming paradigms based on a local (diffusion-like) model of computation, the scan model of computation, a general interprocessor communications model, and a region-based model are introduced. Algorithms for connected component labeling, distance transformation, Voronoi diagrams, finding minimum cost paths, local means, shape-from-shading, hidden surface calculations, affine transformation, oblique parallel projection, and spatial operations over regions are presented. An new algorithm for interpolating irregularly spaced data via Voronoi diagrams is also described.

  4. Parallel Eclipse Project Checkout

    NASA Technical Reports Server (NTRS)

    Crockett, Thomas M.; Joswig, Joseph C.; Shams, Khawaja S.; Powell, Mark W.; Bachmann, Andrew G.

    2011-01-01

    Parallel Eclipse Project Checkout (PEPC) is a program written to leverage parallelism and to automate the checkout process of plug-ins created in Eclipse RCP (Rich Client Platform). Eclipse plug-ins can be aggregated in a feature project. This innovation digests a feature description (xml file) and automatically checks out all of the plug-ins listed in the feature. This resolves the issue of manually checking out each plug-in required to work on the project. To minimize the amount of time necessary to checkout the plug-ins, this program makes the plug-in checkouts parallel. After parsing the feature, a request to checkout for each plug-in in the feature has been inserted. These requests are handled by a thread pool with a configurable number of threads. By checking out the plug-ins in parallel, the checkout process is streamlined before getting started on the project. For instance, projects that took 30 minutes to checkout now take less than 5 minutes. The effect is especially clear on a Mac, which has a network monitor displaying the bandwidth use. When running the client from a developer s home, the checkout process now saturates the bandwidth in order to get all the plug-ins checked out as fast as possible. For comparison, a checkout process that ranged from 8-200 Kbps from a developer s home is now able to saturate a pipe of 1.3 Mbps, resulting in significantly faster checkouts. Eclipse IDE (integrated development environment) tries to build a project as soon as it is downloaded. As part of another optimization, this innovation programmatically tells Eclipse to stop building while checkouts are happening, which dramatically reduces lock contention and enables plug-ins to continue downloading until all of them finish. Furthermore, the software re-enables automatic building, and forces Eclipse to do a clean build once it finishes checking out all of the plug-ins. This software is fully generic and does not contain any NASA-specific code. It can be applied to any

  5. FWT2D: A massively parallel program for frequency-domain full-waveform tomography of wide-aperture seismic data—Part 1: Algorithm

    NASA Astrophysics Data System (ADS)

    Sourbier, Florent; Operto, Stéphane; Virieux, Jean; Amestoy, Patrick; L'Excellent, Jean-Yves

    2009-03-01

    This is the first paper in a two-part series that describes a massively parallel code that performs 2D frequency-domain full-waveform inversion of wide-aperture seismic data for imaging complex structures. Full-waveform inversion methods, namely quantitative seismic imaging methods based on the resolution of the full wave equation, are computationally expensive. Therefore, designing efficient algorithms which take advantage of parallel computing facilities is critical for the appraisal of these approaches when applied to representative case studies and for further improvements. Full-waveform modelling requires the resolution of a large sparse system of linear equations which is performed with the massively parallel direct solver MUMPS for efficient multiple-shot simulations. Efficiency of the multiple-shot solution phase (forward/backward substitutions) is improved by using the BLAS3 library. The inverse problem relies on a classic local optimization approach implemented with a gradient method. The direct solver returns the multiple-shot wavefield solutions distributed over the processors according to a domain decomposition driven by the distribution of the LU factors. The domain decomposition of the wavefield solutions is used to compute in parallel the gradient of the objective function and the diagonal Hessian, this latter providing a suitable scaling of the gradient. The algorithm allows one to test different strategies for multiscale frequency inversion ranging from successive mono-frequency inversion to simultaneous multifrequency inversion. These different inversion strategies will be illustrated in the following companion paper. The parallel efficiency and the scalability of the code will also be quantified.

  6. West Virginia US Department of Energy experimental program to stimulate competitive research. Section 2: Human resource development; Section 3: Carbon-based structural materials research cluster; Section 3: Data parallel algorithms for scientific computing

    SciTech Connect

    Not Available

    1994-02-02

    This report consists of three separate but related reports. They are (1) Human Resource Development, (2) Carbon-based Structural Materials Research Cluster, and (3) Data Parallel Algorithms for Scientific Computing. To meet the objectives of the Human Resource Development plan, the plan includes K--12 enrichment activities, undergraduate research opportunities for students at the state`s two Historically Black Colleges and Universities, graduate research through cluster assistantships and through a traineeship program targeted specifically to minorities, women and the disabled, and faculty development through participation in research clusters. One research cluster is the chemistry and physics of carbon-based materials. The objective of this cluster is to develop a self-sustaining group of researchers in carbon-based materials research within the institutions of higher education in the state of West Virginia. The projects will involve analysis of cokes, graphites and other carbons in order to understand the properties that provide desirable structural characteristics including resistance to oxidation, levels of anisotropy and structural characteristics of the carbons themselves. In the proposed cluster on parallel algorithms, research by four WVU faculty and three state liberal arts college faculty are: (1) modeling of self-organized critical systems by cellular automata; (2) multiprefix algorithms and fat-free embeddings; (3) offline and online partitioning of data computation; and (4) manipulating and rendering three dimensional objects. This cluster furthers the state Experimental Program to Stimulate Competitive Research plan by building on existing strengths at WVU in parallel algorithms.

  7. Parallel rendering techniques for massively parallel visualization

    SciTech Connect

    Hansen, C.; Krogh, M.; Painter, J.

    1995-07-01

    As the resolution of simulation models increases, scientific visualization algorithms which take advantage of the large memory. and parallelism of Massively Parallel Processors (MPPs) are becoming increasingly important. For large applications rendering on the MPP tends to be preferable to rendering on a graphics workstation due to the MPP`s abundant resources: memory, disk, and numerous processors. The challenge becomes developing algorithms that can exploit these resources while minimizing overhead, typically communication costs. This paper will describe recent efforts in parallel rendering for polygonal primitives as well as parallel volumetric techniques. This paper presents rendering algorithms, developed for massively parallel processors (MPPs), for polygonal, spheres, and volumetric data. The polygon algorithm uses a data parallel approach whereas the sphere and volume render use a MIMD approach. Implementations for these algorithms are presented for the Thinking Ma.chines Corporation CM-5 MPP.

  8. Massively Parallel QCD

    SciTech Connect

    Soltz, R; Vranas, P; Blumrich, M; Chen, D; Gara, A; Giampap, M; Heidelberger, P; Salapura, V; Sexton, J; Bhanot, G

    2007-04-11

    The theory of the strong nuclear force, Quantum Chromodynamics (QCD), can be numerically simulated from first principles on massively-parallel supercomputers using the method of Lattice Gauge Theory. We describe the special programming requirements of lattice QCD (LQCD) as well as the optimal supercomputer hardware architectures that it suggests. We demonstrate these methods on the BlueGene massively-parallel supercomputer and argue that LQCD and the BlueGene architecture are a natural match. This can be traced to the simple fact that LQCD is a regular lattice discretization of space into lattice sites while the BlueGene supercomputer is a discretization of space into compute nodes, and that both are constrained by requirements of locality. This simple relation is both technologically important and theoretically intriguing. The main result of this paper is the speedup of LQCD using up to 131,072 CPUs on the largest BlueGene/L supercomputer. The speedup is perfect with sustained performance of about 20% of peak. This corresponds to a maximum of 70.5 sustained TFlop/s. At these speeds LQCD and BlueGene are poised to produce the next generation of strong interaction physics theoretical results.

  9. Toward Parallel Document Clustering

    SciTech Connect

    Mogill, Jace A.; Haglin, David J.

    2011-09-01

    A key challenge to automated clustering of documents in large text corpora is the high cost of comparing documents in a multimillion dimensional document space. The Anchors Hierarchy is a fast data structure and algorithm for localizing data based on a triangle inequality obeying distance metric, the algorithm strives to minimize the number of distance calculations needed to cluster the documents into “anchors” around reference documents called “pivots”. We extend the original algorithm to increase the amount of available parallelism and consider two implementations: a complex data structure which affords efficient searching, and a simple data structure which requires repeated sorting. The sorting implementation is integrated with a text corpora “Bag of Words” program and initial performance results of end-to-end a document processing workflow are reported.

  10. Unified Parallel Software

    SciTech Connect

    McKay, Mike

    2003-12-01

    UPS (Unified Paralled Software is a collection of software tools libraries, scripts, executables) that assist in parallel programming. This consists of: o libups.a C/Fortran callable routines for message passing (utilities written on top of MPI) and file IO (utilities written on top of HDF). o libuserd-HDF.so EnSight user-defined reader for visualizing data files written with UPS File IO. o ups_libuserd_query, ups_libuserd_prep.pl, ups_libuserd_script.pl Executables/scripts to get information from data files and to simplify the use of EnSight on those data files. o ups_io_rm/ups_io_cp Manipulate data files written with UPS File IO These tools are portable to a wide variety of Unix platforms.

  11. Unified Parallel Software

    2003-12-01

    UPS (Unified Paralled Software is a collection of software tools libraries, scripts, executables) that assist in parallel programming. This consists of: o libups.a C/Fortran callable routines for message passing (utilities written on top of MPI) and file IO (utilities written on top of HDF). o libuserd-HDF.so EnSight user-defined reader for visualizing data files written with UPS File IO. o ups_libuserd_query, ups_libuserd_prep.pl, ups_libuserd_script.pl Executables/scripts to get information from data files and to simplify the use ofmore » EnSight on those data files. o ups_io_rm/ups_io_cp Manipulate data files written with UPS File IO These tools are portable to a wide variety of Unix platforms.« less

  12. A join algorithm for combining AND parallel solutions in AND/OR parallel systems

    SciTech Connect

    Ramkumar, B. ); Kale, L.V. )

    1992-02-01

    When two or more literals in the body of a Prolog clause are solved in (AND) parallel, their solutions need to be joined to compute solutions for the clause. This is often a difficult problem in parallel Prolog systems that exploit OR and independent AND parallelism in Prolog programs. In several AND/OR parallel systems proposed recently, this problem is side-stepped at the cost of unexploited OR parallelism in the program, in part due to the complexity of the backtracking algorithm beneath AND parallel branches. In some cases, the data dependency graphs used by these systems cannot represent all the exploitable independent AND parallelism known at compile time. In this paper, we describe the compile time analysis for an optimized join algorithm for supporting independent AND parallelism in logic programs efficiently without leaving and OR parallelism unexploited. We then discuss how this analysis can be used to yield very efficient runtime behavior. We also discuss problems associated with a tree representation of the search space when arbitrarily complex data dependency graphs are permitted. We describe how these problems can be resolved by mapping the search space onto data dependency graphs themselves. The algorithm has been implemented in a compiler for parallel Prolog based on the reduce-OR process model. The algorithm is suitable for the implementation of AND/OR systems on both shared and nonshared memory machines. Performance on benchmark programs.

  13. The language parallel Pascal and other aspects of the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Reeves, A. P.; Bruner, J. D.

    1982-01-01

    A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.

  14. MPP parallel forth

    NASA Technical Reports Server (NTRS)

    Dorband, John E.

    1987-01-01

    Massively Parallel Processor (MPP) Parallel FORTH is a derivative of FORTH-83 and Unified Software Systems' Uni-FORTH. The extension of FORTH into the realm of parallel processing on the MPP is described. With few exceptions, Parallel FORTH was made to follow the description of Uni-FORTH as closely as possible. Likewise, the parallel FORTH extensions were designed as philosophically similar to serial FORTH as possible. The MPP hardware characteristics, as viewed by the FORTH programmer, is discussed. Then a description is presented of how parallel FORTH is implemented on the MPP.

  15. Parallel Adaptive Mesh Refinement Library

    NASA Technical Reports Server (NTRS)

    Mac-Neice, Peter; Olson, Kevin

    2005-01-01

    Parallel Adaptive Mesh Refinement Library (PARAMESH) is a package of Fortran 90 subroutines designed to provide a computer programmer with an easy route to extension of (1) a previously written serial code that uses a logically Cartesian structured mesh into (2) a parallel code with adaptive mesh refinement (AMR). Alternatively, in its simplest use, and with minimal effort, PARAMESH can operate as a domain-decomposition tool for users who want to parallelize their serial codes but who do not wish to utilize adaptivity. The package builds a hierarchy of sub-grids to cover the computational domain of a given application program, with spatial resolution varying to satisfy the demands of the application. The sub-grid blocks form the nodes of a tree data structure (a quad-tree in two or an oct-tree in three dimensions). Each grid block has a logically Cartesian mesh. The package supports one-, two- and three-dimensional models.

  16. An object-oriented approach to nested data parallelism

    NASA Technical Reports Server (NTRS)

    Sheffler, Thomas J.; Chatterjee, Siddhartha

    1994-01-01

    This paper describes an implementation technique for integrating nested data parallelism into an object-oriented language. Data-parallel programming employs sets of data called 'collections' and expresses parallelism as operations performed over the elements of a collection. When the elements of a collection are also collections, then there is the possibility for 'nested data parallelism.' Few current programming languages support nested data parallelism however. In an object-oriented framework, a collection is a single object. Its type defines the parallel operations that may be applied to it. Our goal is to design and build an object-oriented data-parallel programming environment supporting nested data parallelism. Our initial approach is built upon three fundamental additions to C++. We add new parallel base types by implementing them as classes, and add a new parallel collection type called a 'vector' that is implemented as a template. Only one new language feature is introduced: the 'foreach' construct, which is the basis for exploiting elementwise parallelism over collections. The strength of the method lies in the compilation strategy, which translates nested data-parallel C++ into ordinary C++. Extracting the potential parallelism in nested 'foreach' constructs is called 'flattening' nested parallelism. We show how to flatten 'foreach' constructs using a simple program transformation. Our prototype system produces vector code which has been successfully run on workstations, a CM-2, and a CM-5.

  17. Parallel flow diffusion battery

    DOEpatents

    Yeh, H.C.; Cheng, Y.S.

    1984-01-01

    A parallel flow diffusion battery for determining the mass distribution of an aerosol has a plurality of diffusion cells mounted in parallel to an aerosol stream, each diffusion cell including a stack of mesh wire screens of different density.

  18. Parallel flow diffusion battery

    DOEpatents

    Yeh, Hsu-Chi; Cheng, Yung-Sung

    1984-08-07

    A parallel flow diffusion battery for determining the mass distribution of an aerosol has a plurality of diffusion cells mounted in parallel to an aerosol stream, each diffusion cell including a stack of mesh wire screens of different density.

  19. Stage-by-Stage and Parallel Flow Path Compressor Modeling for a Variable Cycle Engine, NASA Advanced Air Vehicles Program - Commercial Supersonic Technology Project - AeroServoElasticity

    NASA Technical Reports Server (NTRS)

    Kopasakis, George; Connolly, Joseph W.; Cheng, Larry

    2015-01-01

    This paper covers the development of stage-by-stage and parallel flow path compressor modeling approaches for a Variable Cycle Engine. The stage-by-stage compressor modeling approach is an extension of a technique for lumped volume dynamics and performance characteristic modeling. It was developed to improve the accuracy of axial compressor dynamics over lumped volume dynamics modeling. The stage-by-stage compressor model presented here is formulated into a parallel flow path model that includes both axial and rotational dynamics. This is done to enable the study of compressor and propulsion system dynamic performance under flow distortion conditions. The approaches utilized here are generic and should be applicable for the modeling of any axial flow compressor design accurate time domain simulations. The objective of this work is as follows. Given the parameters describing the conditions of atmospheric disturbances, and utilizing the derived formulations, directly compute the transfer function poles and zeros describing these disturbances for acoustic velocity, temperature, pressure, and density. Time domain simulations of representative atmospheric turbulence can then be developed by utilizing these computed transfer functions together with the disturbance frequencies of interest.

  20. The AIS-5000 parallel processor

    SciTech Connect

    Schmitt, L.A.; Wilson, S.S.

    1988-05-01

    The AIS-5000 is a commercially available massively parallel processor which has been designed to operate in an industrial environment. It has fine-grained parallelism with up to 1024 processing elements arranged in a single-instruction multiple-data (SIMD) architecture. The processing elements are arranged in a one-dimensional chain that, for computer vision applications, can be as wide as the image itself. This architecture has superior cost/performance characteristics than two-dimensional mesh-connected systems. The design of the processing elements and their interconnections as well as the software used to program the system allow a wide variety of algorithms and applications to be implemented. In this paper, the overall architecture of the system is described. Various components of the system are discussed, including details of the processing elements, data I/O pathways and parallel memory organization. A virtual two-dimensional model for programming image-based algorithms for the system is presented. This model is supported by the AIS-5000 hardware and software and allows the system to be treated as a full-image-size, two-dimensional, mesh-connected parallel processor. Performance bench marks are given for certain simple and complex functions.

  1. Parallel distributed computing using Python

    NASA Astrophysics Data System (ADS)

    Dalcin, Lisandro D.; Paz, Rodrigo R.; Kler, Pablo A.; Cosimo, Alejandro

    2011-09-01

    This work presents two software components aimed to relieve the costs of accessing high-performance parallel computing resources within a Python programming environment: MPI for Python and PETSc for Python. MPI for Python is a general-purpose Python package that provides bindings for the Message Passing Interface (MPI) standard using any back-end MPI implementation. Its facilities allow parallel Python programs to easily exploit multiple processors using the message passing paradigm. PETSc for Python provides access to the Portable, Extensible Toolkit for Scientific Computation (PETSc) libraries. Its facilities allow sequential and parallel Python applications to exploit state of the art algorithms and data structures readily available in PETSc for the solution of large-scale problems in science and engineering. MPI for Python and PETSc for Python are fully integrated to PETSc-FEM, an MPI and PETSc based parallel, multiphysics, finite elements code developed at CIMEC laboratory. This software infrastructure supports research activities related to simulation of fluid flows with applications ranging from the design of microfluidic devices for biochemical analysis to modeling of large-scale stream/aquifer interactions.

  2. A generic fine-grained parallel C

    NASA Technical Reports Server (NTRS)

    Hamet, L.; Dorband, John E.

    1988-01-01

    With the present availability of parallel processors of vastly different architectures, there is a need for a common language interface to multiple types of machines. The parallel C compiler, currently under development, is intended to be such a language. This language is based on the belief that an algorithm designed around fine-grained parallelism can be mapped relatively easily to different parallel architectures, since a large percentage of the parallelism has been identified. The compiler generates a FORTH-like machine-independent intermediate code. A machine-dependent translator will reside on each machine to generate the appropriate executable code, taking advantage of the particular architectures. The goal of this project is to allow a user to run the same program on such machines as the Massively Parallel Processor, the CRAY, the Connection Machine, and the CYBER 205 as well as serial machines such as VAXes, Macintoshes and Sun workstations.

  3. Parallel simulation today

    NASA Technical Reports Server (NTRS)

    Nicol, David; Fujimoto, Richard

    1992-01-01

    This paper surveys topics that presently define the state of the art in parallel simulation. Included in the tutorial are discussions on new protocols, mathematical performance analysis, time parallelism, hardware support for parallel simulation, load balancing algorithms, and dynamic memory management for optimistic synchronization.

  4. Verbal and Visual Parallelism

    ERIC Educational Resources Information Center

    Fahnestock, Jeanne

    2003-01-01

    This study investigates the practice of presenting multiple supporting examples in parallel form. The elements of parallelism and its use in argument were first illustrated by Aristotle. Although real texts may depart from the ideal form for presenting multiple examples, rhetorical theory offers a rationale for minimal, parallel presentation. The…

  5. Implementation and performance of parallel Prolog interpreter

    SciTech Connect

    Wei, S.; Kale, L.V.; Balkrishna, R. . Dept. of Computer Science)

    1988-01-01

    In this paper, the authors discuss the implementation of a parallel Prolog interpreter on different parallel machines. The implementation is based on the REDUCE--OR process model which exploits both AND and OR parallelism in logic programs. It is machine independent as it runs on top of the chare-kernel--a machine-independent parallel programming system. The authors also give the performance of the interpreter running a diverse set of benchmark pargrams on parallel machines including shared memory systems: an Alliant FX/8, Sequent and a MultiMax, and a non-shared memory systems: Intel iPSC/32 hypercube, in addition to its performance on a multiprocessor simulation system.

  6. Fully Parallel MHD Stability Analysis Tool

    NASA Astrophysics Data System (ADS)

    Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

    2014-10-01

    Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Initial results of the code parallelization will be reported. Work is supported by the U.S. DOE SBIR program.

  7. Fully Parallel MHD Stability Analysis Tool

    NASA Astrophysics Data System (ADS)

    Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

    2013-10-01

    Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Preliminary results of the code parallelization will be reported. Work is supported by the U.S. DOE SBIR program.

  8. Fully Parallel MHD Stability Analysis Tool

    NASA Astrophysics Data System (ADS)

    Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

    2015-11-01

    Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Results of MARS parallelization and of the development of a new fix boundary equilibrium code adapted for MARS input will be reported. Work is supported by the U.S. DOE SBIR program.

  9. Computer-Aided Parallelizer and Optimizer

    NASA Technical Reports Server (NTRS)

    Jin, Haoqiang

    2011-01-01

    The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.

  10. Parallel processing for scientific computations

    NASA Technical Reports Server (NTRS)

    Alkhatib, Hasan S.

    1991-01-01

    The main contribution of the effort in the last two years is the introduction of the MOPPS system. After doing extensive literature search, we introduced the system which is described next. MOPPS employs a new solution to the problem of managing programs which solve scientific and engineering applications on a distributed processing environment. Autonomous computers cooperate efficiently in solving large scientific problems with this solution. MOPPS has the advantage of not assuming the presence of any particular network topology or configuration, computer architecture, or operating system. It imposes little overhead on network and processor resources while efficiently managing programs concurrently. The core of MOPPS is an intelligent program manager that builds a knowledge base of the execution performance of the parallel programs it is managing under various conditions. The manager applies this knowledge to improve the performance of future runs. The program manager learns from experience.

  11. Merlin - Massively parallel heterogeneous computing

    NASA Technical Reports Server (NTRS)

    Wittie, Larry; Maples, Creve

    1989-01-01

    Hardware and software for Merlin, a new kind of massively parallel computing system, are described. Eight computers are linked as a 300-MIPS prototype to develop system software for a larger Merlin network with 16 to 64 nodes, totaling 600 to 3000 MIPS. These working prototypes help refine a mapped reflective memory technique that offers a new, very general way of linking many types of computer to form supercomputers. Processors share data selectively and rapidly on a word-by-word basis. Fast firmware virtual circuits are reconfigured to match topological needs of individual application programs. Merlin's low-latency memory-sharing interfaces solve many problems in the design of high-performance computing systems. The Merlin prototypes are intended to run parallel programs for scientific applications and to determine hardware and software needs for a future Teraflops Merlin network.

  12. Parallel Computing in SCALE

    SciTech Connect

    DeHart, Mark D; Williams, Mark L; Bowman, Stephen M

    2010-01-01

    The SCALE computational architecture has remained basically the same since its inception 30 years ago, although constituent modules and capabilities have changed significantly. This SCALE concept was intended to provide a framework whereby independent codes can be linked to provide a more comprehensive capability than possible with the individual programs - allowing flexibility to address a wide variety of applications. However, the current system was designed originally for mainframe computers with a single CPU and with significantly less memory than today's personal computers. It has been recognized that the present SCALE computation system could be restructured to take advantage of modern hardware and software capabilities, while retaining many of the modular features of the present system. Preliminary work is being done to define specifications and capabilities for a more advanced computational architecture. This paper describes the state of current SCALE development activities and plans for future development. With the release of SCALE 6.1 in 2010, a new phase of evolutionary development will be available to SCALE users within the TRITON and NEWT modules. The SCALE (Standardized Computer Analyses for Licensing Evaluation) code system developed by Oak Ridge National Laboratory (ORNL) provides a comprehensive and integrated package of codes and nuclear data for a wide range of applications in criticality safety, reactor physics, shielding, isotopic depletion and decay, and sensitivity/uncertainty (S/U) analysis. Over the last three years, since the release of version 5.1 in 2006, several important new codes have been introduced within SCALE, and significant advances applied to existing codes. Many of these new features became available with the release of SCALE 6.0 in early 2009. However, beginning with SCALE 6.1, a first generation of parallel computing is being introduced. In addition to near-term improvements, a plan for longer term SCALE enhancement

  13. Parallel algorithm development

    SciTech Connect

    Adams, T.F.

    1996-06-01

    Rapid changes in parallel computing technology are causing significant changes in the strategies being used for parallel algorithm development. One approach is simply to write computer code in a standard language like FORTRAN 77 or with the expectation that the compiler will produce executable code that will run in parallel. The alternatives are: (1) to build explicit message passing directly into the source code; or (2) to write source code without explicit reference to message passing or parallelism, but use a general communications library to provide efficient parallel execution. Application of these strategies is illustrated with examples of codes currently under development.

  14. The Nexus task-parallel runtime system

    SciTech Connect

    Foster, I.; Tuecke, S.; Kesselman, C.

    1994-12-31

    A runtime system provides a parallel language compiler with an interface to the low-level facilities required to support interaction between concurrently executing program components. Nexus is a portable runtime system for task-parallel programming languages. Distinguishing features of Nexus include its support for multiple threads of control, dynamic processor acquisition, dynamic address space creation, a global memory model via interprocessor references, and asynchronous events. In addition, it supports heterogeneity at multiple levels, allowing a single computation to utilize different programming languages, executables, processors, and network protocols. Nexus is currently being used as a compiler target for two task-parallel languages: Fortran M and Compositional C++. In this paper, we present the Nexus design, outline techniques used to implement Nexus on parallel computers, show how it is used in compilers, and compare its performance with that of another runtime system.

  15. Parallel Computing Strategies for Irregular Algorithms

    NASA Technical Reports Server (NTRS)

    Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)

    2002-01-01

    Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.

  16. An Evaluation of Kernel Equating: Parallel Equating with Classical Methods in the SAT Subject Tests[TM] Program. Research Report. ETS RR-09-06

    ERIC Educational Resources Information Center

    Grant, Mary C.; Zhang, Lilly; Damiano, Michele

    2009-01-01

    This study investigated kernel equating methods by comparing these methods to operational equatings for two tests in the SAT Subject Tests[TM] program. GENASYS (ETS, 2007) was used for all equating methods and scaled score kernel equating results were compared to Tucker, Levine observed score, chained linear, and chained equipercentile equating…

  17. Partitioning problems in parallel, pipelined and distributed computing

    NASA Technical Reports Server (NTRS)

    Bokhari, S.

    1985-01-01

    The problem of optimally assigning the modules of a parallel program over the processors of a multiple computer system is addressed. A Sum-Bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple satellite system: partitioning multiple chain structured parallel programs, multiple arbitrarily structured serial programs and single tree structured parallel programs. In addition, the problems of partitioning chain structured parallel programs across chain connected systems and across shared memory (or shared bus) systems are also solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple computer architectures for a wide range of problems of practical interest.

  18. Parallel digital forensics infrastructure.

    SciTech Connect

    Liebrock, Lorie M.; Duggan, David Patrick

    2009-10-01

    This report documents the architecture and implementation of a Parallel Digital Forensics infrastructure. This infrastructure is necessary for supporting the design, implementation, and testing of new classes of parallel digital forensics tools. Digital Forensics has become extremely difficult with data sets of one terabyte and larger. The only way to overcome the processing time of these large sets is to identify and develop new parallel algorithms for performing the analysis. To support algorithm research, a flexible base infrastructure is required. A candidate architecture for this base infrastructure was designed, instantiated, and tested by this project, in collaboration with New Mexico Tech. Previous infrastructures were not designed and built specifically for the development and testing of parallel algorithms. With the size of forensics data sets only expected to increase significantly, this type of infrastructure support is necessary for continued research in parallel digital forensics. This report documents the implementation of the parallel digital forensics (PDF) infrastructure architecture and implementation.

  19. Parallel, Distributed Scripting with Python

    SciTech Connect

    Miller, P J

    2002-05-24

    Parallel computers used to be, for the most part, one-of-a-kind systems which were extremely difficult to program portably. With SMP architectures, the advent of the POSIX thread API and OpenMP gave developers ways to portably exploit on-the-box shared memory parallelism. Since these architectures didn't scale cost-effectively, distributed memory clusters were developed. The associated MPI message passing libraries gave these systems a portable paradigm too. Having programmers effectively use this paradigm is a somewhat different question. Distributed data has to be explicitly transported via the messaging system in order for it to be useful. In high level languages, the MPI library gives access to data distribution routines in C, C++, and FORTRAN. But we need more than that. Many reasonable and common tasks are best done in (or as extensions to) scripting languages. Consider sysadm tools such as password crackers, file purgers, etc ... These are simple to write in a scripting language such as Python (an open source, portable, and freely available interpreter). But these tasks beg to be done in parallel. Consider the a password checker that checks an encrypted password against a 25,000 word dictionary. This can take around 10 seconds in Python (6 seconds in C). It is trivial to parallelize if you can distribute the information and co-ordinate the work.

  20. CTRANS: A Monte Carlo program for radiative transfer in plane parallel atmospheres with imbedded finite clouds: Development, testing and user's guide

    NASA Technical Reports Server (NTRS)

    1976-01-01

    The program called CTRANS is described which was designed to perform radiative transfer computations in an atmosphere with horizontal inhomogeneities (clouds). Since the atmosphere-ground system was to be richly detailed, the Monte Carlo method was employed. This means that results are obtained through direct modeling of the physical process of radiative transport. The effects of atmopheric or ground albedo pattern detail are essentially built up from their impact upon the transport of individual photons. The CTRANS program actually tracks the photons backwards through the atmosphere, initiating them at a receiver and following them backwards along their path to the Sun. The pattern of incident photons generated through backwards tracking automatically reflects the importance to the receiver of each region of the sky. Further, through backwards tracking, the impact of the finite field of view of the receiver and variations in its response over the field of view can be directly simulated.

  1. PCLIPS: Parallel CLIPS

    NASA Technical Reports Server (NTRS)

    Hall, Lawrence O.; Bennett, Bonnie H.; Tello, Ivan

    1994-01-01

    A parallel version of CLIPS 5.1 has been developed to run on Intel Hypercubes. The user interface is the same as that for CLIPS with some added commands to allow for parallel calls. A complete version of CLIPS runs on each node of the hypercube. The system has been instrumented to display the time spent in the match, recognize, and act cycles on each node. Only rule-level parallelism is supported. Parallel commands enable the assertion and retraction of facts to/from remote nodes working memory. Parallel CLIPS was used to implement a knowledge-based command, control, communications, and intelligence (C(sup 3)I) system to demonstrate the fusion of high-level, disparate sources. We discuss the nature of the information fusion problem, our approach, and implementation. Parallel CLIPS has also be used to run several benchmark parallel knowledge bases such as one to set up a cafeteria. Results show from running Parallel CLIPS with parallel knowledge base partitions indicate that significant speed increases, including superlinear in some cases, are possible.

  2. Parallel MR Imaging

    PubMed Central

    Deshmane, Anagha; Gulani, Vikas; Griswold, Mark A.; Seiberlich, Nicole

    2015-01-01

    Parallel imaging is a robust method for accelerating the acquisition of magnetic resonance imaging (MRI) data, and has made possible many new applications of MR imaging. Parallel imaging works by acquiring a reduced amount of k-space data with an array of receiver coils. These undersampled data can be acquired more quickly, but the undersampling leads to aliased images. One of several parallel imaging algorithms can then be used to reconstruct artifact-free images from either the aliased images (SENSE-type reconstruction) or from the under-sampled data (GRAPPA-type reconstruction). The advantages of parallel imaging in a clinical setting include faster image acquisition, which can be used, for instance, to shorten breath-hold times resulting in fewer motion-corrupted examinations. In this article the basic concepts behind parallel imaging are introduced. The relationship between undersampling and aliasing is discussed and two commonly used parallel imaging methods, SENSE and GRAPPA, are explained in detail. Examples of artifacts arising from parallel imaging are shown and ways to detect and mitigate these artifacts are described. Finally, several current applications of parallel imaging are presented and recent advancements and promising research in parallel imaging are briefly reviewed. PMID:22696125

  3. Interfacing Computer Aided Parallelization and Performance Analysis

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Biegel, Bryan A. (Technical Monitor)

    2003-01-01

    When porting sequential applications to parallel computer architectures, the program developer will typically go through several cycles of source code optimization and performance analysis. We have started a project to develop an environment where the user can jointly navigate through program structure and performance data information in order to make efficient optimization decisions. In a prototype implementation we have interfaced the CAPO computer aided parallelization tool with the Paraver performance analysis tool. We describe both tools and their interface and give an example for how the interface helps within the program development cycle of a benchmark code.

  4. Parallel Performance of a Combustion Chemistry Simulation

    DOE PAGES

    Skinner, Gregg; Eigenmann, Rudolf

    1995-01-01

    We used a description of a combustion simulation's mathematical and computational methods to develop a version for parallel execution. The result was a reasonable performance improvement on small numbers of processors. We applied several important programming techniques, which we describe, in optimizing the application. This work has implications for programming languages, compiler design, and software engineering.

  5. Advanced parallel processing with supercomputer architectures

    SciTech Connect

    Hwang, K.

    1987-10-01

    This paper investigates advanced parallel processing techniques and innovative hardware/software architectures that can be applied to boost the performance of supercomputers. Critical issues on architectural choices, parallel languages, compiling techniques, resource management, concurrency control, programming environment, parallel algorithms, and performance enhancement methods are examined and the best answers are presented. The authors cover advanced processing techniques suitable for supercomputers, high-end mainframes, minisupers, and array processors. The coverage emphasizes vectorization, multitasking, multiprocessing, and distributed computing. In order to achieve these operation modes, parallel languages, smart compilers, synchronization mechanisms, load balancing methods, mapping parallel algorithms, operating system functions, application library, and multidiscipline interactions are investigated to ensure high performance. At the end, they assess the potentials of optical and neural technologies for developing future supercomputers.

  6. Two Level Parallel Grammatical Evolution

    NASA Astrophysics Data System (ADS)

    Ošmera, Pavel

    This paper describes a Two Level Parallel Grammatical Evolution (TLPGE) that can evolve complete programs using a variable length linear genome to govern the mapping of a Backus Naur Form grammar definition. To increase the efficiency of Grammatical Evolution (GE) the influence of backward processing was tested and a second level with differential evolution was added. The significance of backward coding (BC) and the comparison with standard coding of GEs is presented. The new method is based on parallel grammatical evolution (PGE) with a backward processing algorithm, which is further extended with a differential evolution algorithm. Thus a two-level optimization method was formed in attempt to take advantage of the benefits of both original methods and avoid their difficulties. Both methods used are discussed and the architecture of their combination is described. Also application is discussed and results on a real-word application are described.

  7. Parallelizing the XSTAR Photoionization Code

    NASA Astrophysics Data System (ADS)

    Noble, M. S.; Ji, L.; Young, A.; Lee, J. C.

    2009-09-01

    We describe two means by which XSTAR, a code which computes physical conditions and emission spectra of photoionized gases, has been parallelized. The first is pvmxstar, a wrapper which can be used in place of the serial xstar2xspec script to foster concurrent execution of the XSTAR command line application on independent sets of parameters. The second is pmodel, a plugin for the Interactive Spectral Interpretation System (ISIS) which allows arbitrary components of a broad range of astrophysical models to be distributed across processors during fitting and confidence limits calculations, by scientists with little training in parallel programming. Plugging the XSTAR family of analytic models into pmodel enables multiple ionization states (e.g., of a complex absorber/emitter) to be computed simultaneously, alleviating the often prohibitive expense of the traditional serial approach. Initial performance results indicate that these methods substantially enlarge the problem space to which XSTAR may be applied within practical timeframes.

  8. Extensions of ADA for SIMD parallel processing

    SciTech Connect

    Cline, C.; Siegel, H.J.

    1983-01-01

    In order to program SIMD (single instruction stream-multiple data stream) parallel machines used for tasks such as speech and image processing, a language with explicit parallel constructs is often desirable. The language ADA, developed by the Department of Defense, is used as a basis for such a language. Extensions of ADA which allow the user to specify such things as interprocessor communications and activation of processors are proposed. 25 references.

  9. Code Parallelization with CAPO: A User Manual

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Frumkin, Michael; Yan, Jerry; Biegel, Bryan (Technical Monitor)

    2001-01-01

    A software tool has been developed to assist the parallelization of scientific codes. This tool, CAPO, extends an existing parallelization toolkit, CAPTools developed at the University of Greenwich, to generate OpenMP parallel codes for shared memory architectures. This is an interactive toolkit to transform a serial Fortran application code to an equivalent parallel version of the software - in a small fraction of the time normally required for a manual parallelization. We first discuss the way in which loop types are categorized and how efficient OpenMP directives can be defined and inserted into the existing code using the in-depth interprocedural analysis. The use of the toolkit on a number of application codes ranging from benchmark to real-world application codes is presented. This will demonstrate the great potential of using the toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of processors. The second part of the document gives references to the parameters and the graphic user interface implemented in the toolkit. Finally a set of tutorials is included for hands-on experiences with this toolkit.

  10. Totally parallel multilevel algorithms

    NASA Technical Reports Server (NTRS)

    Frederickson, Paul O.

    1988-01-01

    Four totally parallel algorithms for the solution of a sparse linear system have common characteristics which become quite apparent when they are implemented on a highly parallel hypercube such as the CM2. These four algorithms are Parallel Superconvergent Multigrid (PSMG) of Frederickson and McBryan, Robust Multigrid (RMG) of Hackbusch, the FFT based Spectral Algorithm, and Parallel Cyclic Reduction. In fact, all four can be formulated as particular cases of the same totally parallel multilevel algorithm, which are referred to as TPMA. In certain cases the spectral radius of TPMA is zero, and it is recognized to be a direct algorithm. In many other cases the spectral radius, although not zero, is small enough that a single iteration per timestep keeps the local error within the required tolerance.

  11. Massively parallel mathematical sieves

    SciTech Connect

    Montry, G.R.

    1989-01-01

    The Sieve of Eratosthenes is a well-known algorithm for finding all prime numbers in a given subset of integers. A parallel version of the Sieve is described that produces computational speedups over 800 on a hypercube with 1,024 processing elements for problems of fixed size. Computational speedups as high as 980 are achieved when the problem size per processor is fixed. The method of parallelization generalizes to other sieves and will be efficient on any ensemble architecture. We investigate two highly parallel sieves using scattered decomposition and compare their performance on a hypercube multiprocessor. A comparison of different parallelization techniques for the sieve illustrates the trade-offs necessary in the design and implementation of massively parallel algorithms for large ensemble computers.

  12. Structural mechanics computations on parallel computing platforms

    SciTech Connect

    Kulak, R.F.; Plaskacz, E.J.; Pfeiffer, P.A.

    1995-06-01

    With recent advances in parallel supercomputers and network-connected workstations, the solution to large scale structural engineering problems has now become tractable. High-performance computer architectures, which are usually available at large universities and national laboratories, now can solve large nonlinear problems. At the other end of the spectrum, network connected workstations can be configured to become a distributed-parallel computer. This approach is attractive to small, medium and large engineering firms. This paper describes the development of a parallelized finite element computer program for the solution of static, nonlinear structural mechanics problems.

  13. Automatic generation of synchronization instructions for parallel processors

    SciTech Connect

    Midkiff, S.P.

    1986-05-01

    The development of high speed parallel multi-processors, capable of parallel execution of doacross and forall loops, has stimulated the development of compilers to transform serial FORTRAN programs to parallel forms. One of the duties of such a compiler must be to place synchronization instructions in the parallel version of the program to insure the legal execution order of doacross and forall loops. This thesis gives strategies usable by a compiler to generate these synchronization instructions. It presents algorithms for reducing the parallelism in FORTRAN programs to match a target architecture, recovering some of the parallelism so discarded, and reducing the number of synchronization instructions that must be added to a FORTRAN program, as well as basic strategies for placing synchronization instructions. These algorithms are developed for two synchronization instruction sets. 20 refs., 56 figs.

  14. Bayer image parallel decoding based on GPU

    NASA Astrophysics Data System (ADS)

    Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua

    2012-11-01

    In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.

  15. FORTRAN Extensions for Modular Parallel Processing

    1996-01-12

    FORTRAN M is a small set of extensions to FORTRAN that supports a modular approach to the construction of sequential and parallel programs. FORTRAN M programs use channels to plug together processes which may be written in FORTRAN M or FORTRAN 77. Processes communicate by sending and receiving messages on channels. Channels and processes can be created dynamically, but programs remain deterministic unless specialized nondeterministic constructs are used.

  16. Parallel compilation: A design and its application to SIMULA 67

    NASA Technical Reports Server (NTRS)

    Schwartz, R. L.

    1977-01-01

    A design is presented for a parallel compilation facility for the SIMULA 67 programming language. The proposed facility allows top-down, bottom-up, or parallel development and integration of program modules. An evaluation of the proposal and a discussion of its applicability to other languages are then given.

  17. The NAS parallel benchmarks

    NASA Technical Reports Server (NTRS)

    Bailey, David (Editor); Barton, John (Editor); Lasinski, Thomas (Editor); Simon, Horst (Editor)

    1993-01-01

    A new set of benchmarks was developed for the performance evaluation of highly parallel supercomputers. These benchmarks consist of a set of kernels, the 'Parallel Kernels,' and a simulated application benchmark. Together they mimic the computation and data movement characteristics of large scale computational fluid dynamics (CFD) applications. The principal distinguishing feature of these benchmarks is their 'pencil and paper' specification - all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional benchmarking approaches on highly parallel systems are avoided.

  18. PISCES: An environment for parallel scientific computation

    NASA Technical Reports Server (NTRS)

    Pratt, T. W.

    1985-01-01

    The parallel implementation of scientific computing environment (PISCES) is a project to provide high-level programming environments for parallel MIMD computers. Pisces 1, the first of these environments, is a FORTRAN 77 based environment which runs under the UNIX operating system. The Pisces 1 user programs in Pisces FORTRAN, an extension of FORTRAN 77 for parallel processing. The major emphasis in the Pisces 1 design is in providing a carefully specified virtual machine that defines the run-time environment within which Pisces FORTRAN programs are executed. Each implementation then provides the same virtual machine, regardless of differences in the underlying architecture. The design is intended to be portable to a variety of architectures. Currently Pisces 1 is implemented on a network of Apollo workstations and on a DEC VAX uniprocessor via simulation of the task level parallelism. An implementation for the Flexible Computing Corp. FLEX/32 is under construction. An introduction to the Pisces 1 virtual computer and the FORTRAN 77 extensions is presented. An example of an algorithm for the iterative solution of a system of equations is given. The most notable features of the design are the provision for several granularities of parallelism in programs and the provision of a window mechanism for distributed access to large arrays of data.

  19. The PARTY parallel runtime system

    NASA Technical Reports Server (NTRS)

    Saltz, J. H.; Mirchandaney, Ravi; Smith, R. M.; Crowley, Kay; Nicol, D. M.

    1989-01-01

    In the present automated system for the organization of the data and computational operations entailed by parallel problems, in ways that optimize multiprocessor performance, general heuristics for partitioning program data and control are implemented by capturing and manipulating representations of a computation at run time. These heuristics are directed toward the dynamic identification and allocation of concurrent work in computations with irregular computational patterns. An optimized static-workload partitioning is computed for such repetitive-computation pattern problems as the iterative ones employed in scientific computation.

  20. Parallel processing and medium-scale multiprocessors

    SciTech Connect

    Wouk, A.

    1989-01-01

    For some time, the community interested in large-scale scientific computing has been attempting to come to terms with parallel computation using a number of processors sufficient to make their concurrent utilization interesting, challenging, and, in the long run, beneficial. Unexpected consequences of parallelization have been discovered. It is possible to obtain reduced performance, both relative and absolute, from an increased number of processors, as a result of inappropriate use of resources in a multiprocessor environment. This exemplifies one of the paradoxes which result from our cultural bias towards sequential thought processes. As a consequence there is a bias for sequential styles of program development in a multiprocessor environment. The authors have learned that the problem of automatic optimization in compilation of parallel programs is computationally hard. Early hopes that automatic, optimal parallelization of sequentially conceived programs would be as achievable as earlier automatic vectorization had been, have been dashed. The authors lack the insights and folklore which are needed to develop useful methodologies and heuristics in the area of parallel computation. The authors are embarked on a voyage of exploration of this new territory, and the work described in this volume can provide helpful guidance. The authors have to explore fully the differences between distributed memory systems, shared memory systems, and combinations, as well as the relative applicability of SIMD and MIMD architectures. Based on the information obtained in such exploration, useful steps towards efficient utilization of many processors should become possible. This paper covers several areas: systems programming, parallel/language/programming systems, and applications programming.

  1. Highly parallel sparse Cholesky factorization

    NASA Technical Reports Server (NTRS)

    Gilbert, John R.; Schreiber, Robert

    1990-01-01

    Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms.

  2. The Parallel Axiom

    ERIC Educational Resources Information Center

    Rogers, Pat

    1972-01-01

    Criteria for a reasonable axiomatic system are discussed. A discussion of the historical attempts to prove the independence of Euclids parallel postulate introduces non-Euclidean geometries. Poincare's model for a non-Euclidean geometry is defined and analyzed. (LS)

  3. Scalable parallel communications

    NASA Technical Reports Server (NTRS)

    Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.

    1992-01-01

    Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth

  4. Gang scheduling a parallel machine. Revision 1

    SciTech Connect

    Gorda, B.C.; Brooks, E.D. III

    1991-12-01

    Program development on parallel machines can be a nightmare of scheduling headaches. We have developed a portable time sharing mechanism to handle the problem of scheduling gangs of processes. User programs and their gangs of processes are put to sleep and awakened by the gang scheduler to provide a time sharing environment. Time quantum are adjusted according to priority queues and a system of fair share accounting. The initial platform for this software is the 128 processor BBN TC2000 in use in the Massively Parallel Computing Initiative at the Lawrence Livermore National Laboratory.

  5. Object-oriented parallel polygon rendering

    SciTech Connect

    Heiland, R.W.

    1994-09-01

    Since many scientific datasets can be visualized using some polygonal representation, a polygon renderer has broad use for scientific visualization. With today`s high performance computing applications producing very large datasets, a parallel polygon renderer is a necessary tool for keeping the compute-visualize cycle at a minimum. This paper presents a DOIV on renderer that combines the shared-memory and message-passing models of parallel programming. It uses the Global Arrays library, a shared-memory programming toolkit for distributed memory machines. The experience of using an object oriented approach for software design and development is also discussed.

  6. Parallel image compression

    NASA Technical Reports Server (NTRS)

    Reif, John H.

    1987-01-01

    A parallel compression algorithm for the 16,384 processor MPP machine was developed. The serial version of the algorithm can be viewed as a combination of on-line dynamic lossless test compression techniques (which employ simple learning strategies) and vector quantization. These concepts are described. How these concepts are combined to form a new strategy for performing dynamic on-line lossy compression is discussed. Finally, the implementation of this algorithm in a massively parallel fashion on the MPP is discussed.

  7. Artificial intelligence in parallel

    SciTech Connect

    Waldrop, M.M.

    1984-08-10

    The current rage in the Artificial Intelligence (AI) community is parallelism: the idea is to build machines with many independent processors doing many things at once. The upshot is that about a dozen parallel machines are now under development for AI alone. As might be expected, the approaches are diverse yet there are a number of fundamental issues in common: granularity, topology, control, and algorithms.

  8. Programs.

    ERIC Educational Resources Information Center

    Community College Journal, 1996

    1996-01-01

    Includes a collection of eight short articles describing model community college programs. Discusses a literacy program, a mobile computer classroom, a support program for at-risk students, a timber-harvesting program, a multimedia presentation on successful women graduates, a career center, a collaboration with NASA, and an Israeli engineering…

  9. Continuous parallel coordinates.

    PubMed

    Heinrich, Julian; Weiskopf, Daniel

    2009-01-01

    Typical scientific data is represented on a grid with appropriate interpolation or approximation schemes,defined on a continuous domain. The visualization of such data in parallel coordinates may reveal patterns latently contained in the data and thus can improve the understanding of multidimensional relations. In this paper, we adopt the concept of continuous scatterplots for the visualization of spatially continuous input data to derive a density model for parallel coordinates. Based on the point-line duality between scatterplots and parallel coordinates, we propose a mathematical model that maps density from a continuous scatterplot to parallel coordinates and present different algorithms for both numerical and analytical computation of the resulting density field. In addition, we show how the 2-D model can be used to successively construct continuous parallel coordinates with an arbitrary number of dimensions. Since continuous parallel coordinates interpolate data values within grid cells, a scalable and dense visualization is achieved, which will be demonstrated for typical multi-variate scientific data.

  10. Parallel strategies for SAR processing

    NASA Astrophysics Data System (ADS)

    Segoviano, Jesus A.

    2004-12-01

    This article proposes a series of strategies for improving the computer process of the Synthetic Aperture Radar (SAR) signal treatment, following the three usual lines of action to speed up the execution of any computer program. On the one hand, it is studied the optimization of both, the data structures and the application architecture used on it. On the other hand it is considered a hardware improvement. For the former, they are studied both, the usually employed SAR process data structures, proposing the use of parallel ones and the way the parallelization of the algorithms employed on the process is implemented. Besides, the parallel application architecture classifies processes between fine/coarse grain. These are assigned to individual processors or separated in a division among processors, all of them in their corresponding architectures. For the latter, it is studied the hardware employed on the computer parallel process used in the SAR handling. The improvement here refers to several kinds of platforms in which the SAR process is implemented, shared memory multicomputers, and distributed memory multiprocessors. A comparison between them gives us some guidelines to follow in order to get a maximum throughput with a minimum latency and a maximum effectiveness with a minimum cost, all together with a limited complexness. It is concluded and described, that the approach consisting of the processing of the algorithms in a GNU/Linux environment, together with a Beowulf cluster platform offers, under certain conditions, the best compromise between performance and cost, and promises the major development in the future for the Synthetic Aperture Radar computer power thirsty applications in the next years.

  11. A parallel PCG solver for MODFLOW.

    PubMed

    Dong, Yanhui; Li, Guomin

    2009-01-01

    In order to simulate large-scale ground water flow problems more efficiently with MODFLOW, the OpenMP programming paradigm was used to parallelize the preconditioned conjugate-gradient (PCG) solver with in this study. Incremental parallelization, the significant advantage supported by OpenMP on a shared-memory computer, made the solver transit to a parallel program smoothly one block of code at a time. The parallel PCG solver, suitable for both MODFLOW-2000 and MODFLOW-2005, is verified using an 8-processor computer. Both the impact of compilers and different model domain sizes were considered in the numerical experiments. Based on the timing results, execution times using the parallel PCG solver are typically about 1.40 to 5.31 times faster than those using the serial one. In addition, the simulation results are the exact same as the original PCG solver, because the majority of serial codes were not changed. It is worth noting that this parallelizing approach reduces cost in terms of software maintenance because only a single source PCG solver code needs to be maintained in the MODFLOW source tree. PMID:19563427

  12. Visualization and Tracking of Parallel CFD Simulations

    NASA Technical Reports Server (NTRS)

    Vaziri, Arsi; Kremenetsky, Mark

    1995-01-01

    We describe a system for interactive visualization and tracking of a 3-D unsteady computational fluid dynamics (CFD) simulation on a parallel computer. CM/AVS, a distributed, parallel implementation of a visualization environment (AVS) runs on the CM-5 parallel supercomputer. A CFD solver is run as a CM/AVS module on the CM-5. Data communication between the solver, other parallel visualization modules, and a graphics workstation, which is running AVS, are handled by CM/AVS. Partitioning of the visualization task, between CM-5 and the workstation, can be done interactively in the visual programming environment provided by AVS. Flow solver parameters can also be altered by programmable interactive widgets. This system partially removes the requirement of storing large solution files at frequent time steps, a characteristic of the traditional 'simulate (yields) store (yields) visualize' post-processing approach.

  13. Modified mesh-connected parallel computers

    SciTech Connect

    Carlson, D.A. )

    1988-10-01

    The mesh-connected parallel computer is an important parallel processing organization that has been used in the past for the design of supercomputing systems. In this paper, the authors explore modifications of a mesh-connected parallel computer for the purpose of increasing the efficiency of executing important application programs. These modifications are made by adding one or more global mesh structures to the processing array. They show how our modifications allow asymptotic improvements in the efficiency of executing computations having low to medium interprocessor communication requirements (e.g., tree computations, prefix computations, finding the connected components of a graph). For computations with high interprocessor communication requirements such as sorting, they show that they offer no speedup. They also compare the modified mesh-connected parallel computer to other similar organizations including the pyramid, the X-tree, and the mesh-of-trees.

  14. Finite element computation with parallel VLSI

    NASA Technical Reports Server (NTRS)

    Mcgregor, J.; Salama, M.

    1983-01-01

    This paper describes a parallel processing computer consisting of a 16-bit microcomputer as a master processor which controls and coordinates the activities of 8086/8087 VLSI chip set slave processors working in parallel. The hardware is inexpensive and can be flexibly configured and programmed to perform various functions. This makes it a useful research tool for the development of, and experimentation with parallel mathematical algorithms. Application of the hardware to computational tasks involved in the finite element analysis method is demonstrated by the generation and assembly of beam finite element stiffness matrices. A number of possible schemes for the implementation of N-elements on N- or n-processors (N is greater than n) are described, and the speedup factors of their time consumption are determined as a function of the number of available parallel processors.

  15. Parallel supercomputing today and the cedar approach.

    PubMed

    Kuck, D J; Davidson, E S; Lawrie, D H; Sameh, A H

    1986-02-28

    More and more scientists and engineers are becoming interested in using supercomputers. Earlier barriers to using these machines are disappearing as software for their use improves. Meanwhile, new parallel supercomputer architectures are emerging that may provide rapid growth in performance. These systems may use a large number of processors with an intricate memory system that is both parallel and hierarchical; they will require even more advanced software. Compilers that restructure user programs to exploit the machine organization seem to be essential. A wide range of algorithms and applications is being developed in an effort to provide high parallel processing performance in many fields. The Cedar supercomputer, presently operating with eight processors in parallel, uses advanced system and applications software developed at the University of Illinois during the past 12 years. This software should allow the number of processors in Cedar to be doubled annually, providing rapid performance advances in the next decade. PMID:17740294

  16. Parallel processing of a rotating shaft simulation

    NASA Technical Reports Server (NTRS)

    Arpasi, Dale J.

    1989-01-01

    A FORTRAN program describing the vibration modes of a rotor-bearing system is analyzed for parellelism in this simulation using a Pascal-like structured language. Potential vector operations are also identified. A critical path through the simulation is identified and used in conjunction with somewhat fictitious processor characteristics to determine the time to calculate the problem on a parallel processing system having those characteristics. A parallel processing overhead time is included as a parameter for proper evaluation of the gain over serial calculation. The serial calculation time is determined for the same fictitious system. An improvement of up to 640 percent is possible depending on the value of the overhead time. Based on the analysis, certain conclusions are drawn pertaining to the development needs of parallel processing technology, and to the specification of parallel processing systems to meet computational needs.

  17. Parallel time integration software

    SciTech Connect

    2014-07-01

    This package implements an optimal-scaling multigrid solver for the (non) linear systems that arise from the discretization of problems with evolutionary behavior. Typically, solution algorithms for evolution equations are based on a time-marching approach, solving sequentially for one time step after the other. Parallelism in these traditional time-integrarion techniques is limited to spatial parallelism. However, current trends in computer architectures are leading twards system with more, but not faster. processors. Therefore, faster compute speeds must come from greater parallelism. One approach to achieve parallelism in time is with multigrid, but extending classical multigrid methods for elliptic poerators to this setting is a significant achievement. In this software, we implement a non-intrusive, optimal-scaling time-parallel method based on multigrid reduction techniques. The examples in the package demonstrate optimality of our multigrid-reduction-in-time algorithm (MGRIT) for solving a variety of parabolic equations in two and three sparial dimensions. These examples can also be used to show that MGRIT can achieve significant speedup in comparison to sequential time marching on modern architectures.

  18. Parallel time integration software

    2014-07-01

    This package implements an optimal-scaling multigrid solver for the (non) linear systems that arise from the discretization of problems with evolutionary behavior. Typically, solution algorithms for evolution equations are based on a time-marching approach, solving sequentially for one time step after the other. Parallelism in these traditional time-integrarion techniques is limited to spatial parallelism. However, current trends in computer architectures are leading twards system with more, but not faster. processors. Therefore, faster compute speeds mustmore » come from greater parallelism. One approach to achieve parallelism in time is with multigrid, but extending classical multigrid methods for elliptic poerators to this setting is a significant achievement. In this software, we implement a non-intrusive, optimal-scaling time-parallel method based on multigrid reduction techniques. The examples in the package demonstrate optimality of our multigrid-reduction-in-time algorithm (MGRIT) for solving a variety of parabolic equations in two and three sparial dimensions. These examples can also be used to show that MGRIT can achieve significant speedup in comparison to sequential time marching on modern architectures.« less

  19. Performance Evaluation in Network-Based Parallel Computing

    NASA Technical Reports Server (NTRS)

    Dezhgosha, Kamyar

    1996-01-01

    Network-based parallel computing is emerging as a cost-effective alternative for solving many problems which require use of supercomputers or massively parallel computers. The primary objective of this project has been to conduct experimental research on performance evaluation for clustered parallel computing. First, a testbed was established by augmenting our existing SUNSPARCs' network with PVM (Parallel Virtual Machine) which is a software system for linking clusters of machines. Second, a set of three basic applications were selected. The applications consist of a parallel search, a parallel sort, a parallel matrix multiplication. These application programs were implemented in C programming language under PVM. Third, we conducted performance evaluation under various configurations and problem sizes. Alternative parallel computing models and workload allocations for application programs were explored. The performance metric was limited to elapsed time or response time which in the context of parallel computing can be expressed in terms of speedup. The results reveal that the overhead of communication latency between processes in many cases is the restricting factor to performance. That is, coarse-grain parallelism which requires less frequent communication between processes will result in higher performance in network-based computing. Finally, we are in the final stages of installing an Asynchronous Transfer Mode (ATM) switch and four ATM interfaces (each 155 Mbps) which will allow us to extend our study to newer applications, performance metrics, and configurations.

  20. Parallelism in System Tools

    SciTech Connect

    Matney, Sr., Kenneth D; Shipman, Galen M

    2010-01-01

    The Cray XT, when employed in conjunction with the Lustre filesystem, has provided the ability to generate huge amounts of data in the form of many files. Typically, this is accommodated by satisfying the requests of large numbers of Lustre clients in parallel. In contrast, a single service node (Lustre client) cannot adequately service such datasets. This means that the use of traditional UNIX tools like cp, tar, et alli (with have no parallel capability) can result in substantial impact to user productivity. For example, to copy a 10 TB dataset from the service node using cp would take about 24 hours, under more or less ideal conditions. During production operation, this could easily extend to 36 hours. In this paper, we introduce the Lustre User Toolkit for Cray XT, developed at the Oak Ridge Leadership Computing Facility (OLCF). We will show that Linux commands, implementing highly parallel I/O algorithms, provide orders of magnitude greater performance, greatly reducing impact to productivity.

  1. Parallel optical sampler

    DOEpatents

    Tauke-Pedretti, Anna; Skogen, Erik J; Vawter, Gregory A

    2014-05-20

    An optical sampler includes a first and second 1.times.n optical beam splitters splitting an input optical sampling signal and an optical analog input signal into n parallel channels, respectively, a plurality of optical delay elements providing n parallel delayed input optical sampling signals, n photodiodes converting the n parallel optical analog input signals into n respective electrical output signals, and n optical modulators modulating the input optical sampling signal or the optical analog input signal by the respective electrical output signals, and providing n successive optical samples of the optical analog input signal. A plurality of output photodiodes and eADCs convert the n successive optical samples to n successive digital samples. The optical modulator may be a photodiode interconnected Mach-Zehnder Modulator. A method of sampling the optical analog input signal is disclosed.

  2. SPINning parallel systems software.

    SciTech Connect

    Matlin, O.S.; Lusk, E.; McCune, W.

    2002-03-15

    We describe our experiences in using Spin to verify parts of the Multi Purpose Daemon (MPD) parallel process management system. MPD is a distributed collection of processes connected by Unix network sockets. MPD is dynamic processes and connections among them are created and destroyed as MPD is initialized, runs user processes, recovers from faults, and terminates. This dynamic nature is easily expressible in the Spin/Promela framework but poses performance and scalability challenges. We present here the results of expressing some of the parallel algorithms of MPD and executing both simulation and verification runs with Spin.

  3. Adaptive parallel logic networks

    NASA Technical Reports Server (NTRS)

    Martinez, Tony R.; Vidal, Jacques J.

    1988-01-01

    Adaptive, self-organizing concurrent systems (ASOCS) that combine self-organization with massive parallelism for such applications as adaptive logic devices, robotics, process control, and system malfunction management, are presently discussed. In ASOCS, an adaptive network composed of many simple computing elements operating in combinational and asynchronous fashion is used and problems are specified by presenting if-then rules to the system in the form of Boolean conjunctions. During data processing, which is a different operational phase from adaptation, the network acts as a parallel hardware circuit.

  4. A parallel, portable and versatile treecode

    SciTech Connect

    Warren, M.S.; Salmon, J.K. |

    1994-10-01

    Portability and versatility are important characteristics of a computer program which is meant to be generally useful. We describe how we have developed a parallel N-body treecode to meet these goals. A variety of applications to which the code can be applied are mentioned. Performance of the program is also measured on several machines. A 512 processor Intel Paragon can solve for the forces on 10 million gravitationally interacting particles to 0.5% rms accuracy in 28.6 seconds.

  5. Parallel Total Energy

    2004-10-21

    This is a total energy electronic structure code using Local Density Approximation (LDA) of the density funtional theory. It uses the plane wave as the wave function basis set. It can sue both the norm conserving pseudopotentials and the ultra soft pseudopotentials. It can relax the atomic positions according to the total energy. It is a parallel code using MP1.

  6. Parallel Multigrid Equation Solver

    2001-09-07

    Prometheus is a fully parallel multigrid equation solver for matrices that arise in unstructured grid finite element applications. It includes a geometric and an algebraic multigrid method and has solved problems of up to 76 mullion degrees of feedom, problems in linear elasticity on the ASCI blue pacific and ASCI red machines.

  7. Optical parallel selectionist systems

    NASA Astrophysics Data System (ADS)

    Caulfield, H. John

    1993-01-01

    There are at least two major classes of computers in nature and technology: connectionist and selectionist. A subset of connectionist systems (Turing Machines) dominates modern computing, although another subset (Neural Networks) is growing rapidly. Selectionist machines have unique capabilities which should allow them to do truly creative operations. It is possible to make a parallel optical selectionist system using methods describes in this paper.

  8. Parallel fast gauss transform

    SciTech Connect

    Sampath, Rahul S; Sundar, Hari; Veerapaneni, Shravan

    2010-01-01

    We present fast adaptive parallel algorithms to compute the sum of N Gaussians at N points. Direct sequential computation of this sum would take O(N{sup 2}) time. The parallel time complexity estimates for our algorithms are O(N/n{sub p}) for uniform point distributions and O( (N/n{sub p}) log (N/n{sub p}) + n{sub p}log n{sub p}) for non-uniform distributions using n{sub p} CPUs. We incorporate a plane-wave representation of the Gaussian kernel which permits 'diagonal translation'. We use parallel octrees and a new scheme for translating the plane-waves to efficiently handle non-uniform distributions. Computing the transform to six-digit accuracy at 120 billion points took approximately 140 seconds using 4096 cores on the Jaguar supercomputer. Our implementation is 'kernel-independent' and can handle other 'Gaussian-type' kernels even when explicit analytic expression for the kernel is not known. These algorithms form a new class of core computational machinery for solving parabolic PDEs on massively parallel architectures.

  9. Massively parallel processor computer

    NASA Technical Reports Server (NTRS)

    Fung, L. W. (Inventor)

    1983-01-01

    An apparatus for processing multidimensional data with strong spatial characteristics, such as raw image data, characterized by a large number of parallel data streams in an ordered array is described. It comprises a large number (e.g., 16,384 in a 128 x 128 array) of parallel processing elements operating simultaneously and independently on single bit slices of a corresponding array of incoming data streams under control of a single set of instructions. Each of the processing elements comprises a bidirectional data bus in communication with a register for storing single bit slices together with a random access memory unit and associated circuitry, including a binary counter/shift register device, for performing logical and arithmetical computations on the bit slices, and an I/O unit for interfacing the bidirectional data bus with the data stream source. The massively parallel processor architecture enables very high speed processing of large amounts of ordered parallel data, including spatial translation by shifting or sliding of bits vertically or horizontally to neighboring processing elements.

  10. Parallel hierarchical global illumination

    SciTech Connect

    Snell, Q.O.

    1997-10-08

    Solving the global illumination problem is equivalent to determining the intensity of every wavelength of light in all directions at every point in a given scene. The complexity of the problem has led researchers to use approximation methods for solving the problem on serial computers. Rather than using an approximation method, such as backward ray tracing or radiosity, the authors have chosen to solve the Rendering Equation by direct simulation of light transport from the light sources. This paper presents an algorithm that solves the Rendering Equation to any desired accuracy, and can be run in parallel on distributed memory or shared memory computer systems with excellent scaling properties. It appears superior in both speed and physical correctness to recent published methods involving bidirectional ray tracing or hybrid treatments of diffuse and specular surfaces. Like progressive radiosity methods, it dynamically refines the geometry decomposition where required, but does so without the excessive storage requirements for ray histories. The algorithm, called Photon, produces a scene which converges to the global illumination solution. This amounts to a huge task for a 1997-vintage serial computer, but using the power of a parallel supercomputer significantly reduces the time required to generate a solution. Currently, Photon can be run on most parallel environments from a shared memory multiprocessor to a parallel supercomputer, as well as on clusters of heterogeneous workstations.

  11. Parallel hierarchical radiosity rendering

    SciTech Connect

    Carter, M.

    1993-07-01

    In this dissertation, the step-by-step development of a scalable parallel hierarchical radiosity renderer is documented. First, a new look is taken at the traditional radiosity equation, and a new form is presented in which the matrix of linear system coefficients is transformed into a symmetric matrix, thereby simplifying the problem and enabling a new solution technique to be applied. Next, the state-of-the-art hierarchical radiosity methods are examined for their suitability to parallel implementation, and scalability. Significant enhancements are also discovered which both improve their theoretical foundations and improve the images they generate. The resultant hierarchical radiosity algorithm is then examined for sources of parallelism, and for an architectural mapping. Several architectural mappings are discussed. A few key algorithmic changes are suggested during the process of making the algorithm parallel. Next, the performance, efficiency, and scalability of the algorithm are analyzed. The dissertation closes with a discussion of several ideas which have the potential to further enhance the hierarchical radiosity method, or provide an entirely new forum for the application of hierarchical methods.

  12. Parallel Dislocation Simulator

    2006-10-30

    ParaDiS is software capable of simulating the motion, evolution, and interaction of dislocation networks in single crystals using massively parallel computer architectures. The software is capable of outputting the stress-strain response of a single crystal whose plastic deformation is controlled by the dislocation processes.

  13. A note on parallel efficiency of fire simulation on cluster

    NASA Astrophysics Data System (ADS)

    Valasek, L.; Glasa, J.

    2016-08-01

    Current HPC clusters are capable to reduce execution time of parallelized tasks significantly. The paper discusses the use of two selected strategies of cluster computational resources allocation and their impact on parallel efficiency of fire simulation. Simulation of a simple corridor fire scenario by Fire Dynamics Simulator parallelized by the MPI programming model is tested on the HPC cluster at the Institute of Informatics of Slovak Academy of Sciences in Bratislava (Slovakia). The tests confirm that parallelization has a great potential to reduce execution times achieving promising values of parallel efficiency of the simulation, however, the results also show that the use of increasing numbers of computational meshes resulting in increasing numbers of used computational cores does not necessarily decrease the execution time nor the parallel efficiency of simulation. The results obtained indicate that the simulation achieves different values of the execution time and the parallel efficiency in regard of the used strategy for cluster computational resources allocation.

  14. High Performance Parallel Computational Nanotechnology

    NASA Technical Reports Server (NTRS)

    Saini, Subhash; Craw, James M. (Technical Monitor)

    1995-01-01

    At a recent press conference, NASA Administrator Dan Goldin encouraged NASA Ames Research Center to take a lead role in promoting research and development of advanced, high-performance computer technology, including nanotechnology. Manufacturers of leading-edge microprocessors currently perform large-scale simulations in the design and verification of semiconductor devices and microprocessors. Recently, the need for this intensive simulation and modeling analysis has greatly increased, due in part to the ever-increasing complexity of these devices, as well as the lessons of experiences such as the Pentium fiasco. Simulation, modeling, testing, and validation will be even more important for designing molecular computers because of the complex specification of millions of atoms, thousands of assembly steps, as well as the simulation and modeling needed to ensure reliable, robust and efficient fabrication of the molecular devices. The software for this capacity does not exist today, but it can be extrapolated from the software currently used in molecular modeling for other applications: semi-empirical methods, ab initio methods, self-consistent field methods, Hartree-Fock methods, molecular mechanics; and simulation methods for diamondoid structures. In as much as it seems clear that the application of such methods in nanotechnology will require powerful, highly powerful systems, this talk will discuss techniques and issues for performing these types of computations on parallel systems. We will describe system design issues (memory, I/O, mass storage, operating system requirements, special user interface issues, interconnects, bandwidths, and programming languages) involved in parallel methods for scalable classical, semiclassical, quantum, molecular mechanics, and continuum models; molecular nanotechnology computer-aided designs (NanoCAD) techniques; visualization using virtual reality techniques of structural models and assembly sequences; software required to

  15. An Expert System for the Development of Efficient Parallel Code

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Chun, Robert; Jin, Hao-Qiang; Labarta, Jesus; Gimenez, Judit

    2004-01-01

    We have built the prototype of an expert system to assist the user in the development of efficient parallel code. The system was integrated into the parallel programming environment that is currently being developed at NASA Ames. The expert system interfaces to tools for automatic parallelization and performance analysis. It uses static program structure information and performance data in order to automatically determine causes of poor performance and to make suggestions for improvements. In this paper we give an overview of our programming environment, describe the prototype implementation of our expert system, and demonstrate its usefulness with several case studies.

  16. Chare kernel; A runtime support system for parallel computations

    SciTech Connect

    Shu, W. ); Kale, L.V. )

    1991-03-01

    This paper presents the chare kernel system, which supports parallel computations with irregular structure. The chare kernel is a collection of primitive functions that manage chares, manipulative messages, invoke atomic computations, and coordinate concurrent activities. Programs written in the chare kernel language can be executed on different parallel machines without change. Users writing such programs concern themselves with the creation of parallel actions but not with assigning them to specific processors. The authors describe the design and implementation of the chare kernel. Performance of chare kernel programs on two hypercube machines, the Intel iPSC/2 and the NCUBE, is also given.

  17. A Parallel Algorithm for the Vehicle Routing Problem

    SciTech Connect

    Groer, Christopher S; Golden, Bruce; Edward, Wasil

    2011-01-01

    The vehicle routing problem (VRP) is a dicult and well-studied combinatorial optimization problem. We develop a parallel algorithm for the VRP that combines a heuristic local search improvement procedure with integer programming. We run our parallel algorithm with as many as 129 processors and are able to quickly nd high-quality solutions to standard benchmark problems. We assess the impact of parallelism by analyzing our procedure's performance under a number of dierent scenarios.

  18. Parallel Molecular Dynamics Stencil : a new parallel computing environment for a large-scale molecular dynamics simulation of solids

    NASA Astrophysics Data System (ADS)

    Shimizu, Futoshi; Kimizuka, Hajime; Kaburaki, Hideo

    2002-08-01

    A new parallel computing environment, called as ``Parallel Molecular Dynamics Stencil'', has been developed to carry out a large-scale short-range molecular dynamics simulation of solids. The stencil is written in C language using MPI for parallelization and designed successfully to separate and conceal parts of the programs describing cutoff schemes and parallel algorithms for data communication. This has been made possible by introducing the concept of image atoms. Therefore, only a sequential programming of the force calculation routine is required for executing the stencil in parallel environment. Typical molecular dynamics routines, such as various ensembles, time integration methods, and empirical potentials, have been implemented in the stencil. In the presentation, the performance of the stencil on parallel computers of Hitachi, IBM, SGI, and PC-cluster using the models of Lennard-Jones and the EAM type potentials for fracture problem will be reported.

  19. Parallel computers and parallel algorithms for CFD: An introduction

    NASA Astrophysics Data System (ADS)

    Roose, Dirk; Vandriessche, Rafael

    1995-10-01

    This text presents a tutorial on those aspects of parallel computing that are important for the development of efficient parallel algorithms and software for computational fluid dynamics. We first review the main architectural features of parallel computers and we briefly describe some parallel systems on the market today. We introduce some important concepts concerning the development and the performance evaluation of parallel algorithms. We discuss how work load imbalance and communication costs on distributed memory parallel computers can be minimized. We present performance results for some CFD test cases. We focus on applications using structured and block structured grids, but the concepts and techniques are also valid for unstructured grids.

  20. Parallel computing using a Lagrangian formulation

    NASA Technical Reports Server (NTRS)

    Liou, May-Fun; Loh, Ching Yuen

    1991-01-01

    A new Lagrangian formulation of the Euler equation is adopted for the calculation of 2-D supersonic steady flow. The Lagrangian formulation represents the inherent parallelism of the flow field better than the common Eulerian formulation and offers a competitive alternative on parallel computers. The implementation of the Lagrangian formulation on the Thinking Machines Corporation CM-2 Computer is described. The program uses a finite volume, first-order Godunov scheme and exhibits high accuracy in dealing with multidimensional discontinuities (slip-line and shock). By using this formulation, a better than six times speed-up was achieved on a 8192-processor CM-2 over a single processor of a CRAY-2.

  1. Hypercluster parallel processing library user's manual

    NASA Technical Reports Server (NTRS)

    Quealy, Angela

    1990-01-01

    This User's Manual describes the Hypercluster Parallel Processing Library, composed of FORTRAN-callable subroutines which enable a FORTRAN programmer to manipulate and transfer information throughout the Hypercluster at NASA Lewis Research Center. Each subroutine and its parameters are described in detail. A simple heat flow application using Laplace's equation is included to demonstrate the use of some of the library's subroutines. The manual can be used initially as an introduction to the parallel features provided by the library. Thereafter it can be used as a reference when programming an application.

  2. Berkeley Unified Parallel C (UPC) Compiler

    2003-04-06

    This program is a portable, open-source, compiler for the UPC language, which is based on the Open64 framework, and has extensive support for optimizations. This compiler operated by translating UPC into ANS/ISO C for compilation by a native compiler and linking with a UPC Runtime Library. This design eases portability to both shared and distributed memory parallel architectures. For proper operation the "Berkeley Unified Parallel C (UPC) Runtime Library" and its dependencies are required. Compatiblemore » replacements which implement "The Berkeley UPC Runtime Specification" are possible.« less

  3. Temporal fringe pattern analysis with parallel computing

    SciTech Connect

    Tuck Wah Ng; Kar Tien Ang; Argentini, Gianluca

    2005-11-20

    Temporal fringe pattern analysis is invaluable in transient phenomena studies but necessitates long processing times. Here we describe a parallel computing strategy based on the single-program multiple-data model and hyperthreading processor technology to reduce the execution time. In a two-node cluster workstation configuration we found that execution periods were reduced by 1.6 times when four virtual processors were used. To allow even lower execution times with an increasing number of processors, the time allocated for data transfer, data read, and waiting should be minimized. Parallel computing is found here to present a feasible approach to reduce execution times in temporal fringe pattern analysis.

  4. A Simple Physical Optics Algorithm Perfect for Parallel Computing Architecture

    NASA Technical Reports Server (NTRS)

    Imbriale, W. A.; Cwik, T.

    1994-01-01

    A reflector antenna computer program based upon a simple discreet approximation of the radiation integral has proven to be extremely easy to adapt to the parallel computing architecture of the modest number of large-gain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.

  5. Fast Inverse Distance Weighting-Based Spatiotemporal Interpolation: A Web-Based Application of Interpolating Daily Fine Particulate Matter PM2.5 in the Contiguous U.S. Using Parallel Programming and k-d Tree

    PubMed Central

    Li, Lixin; Losser, Travis; Yorke, Charles; Piltner, Reinhard

    2014-01-01

    Epidemiological studies have identified associations between mortality and changes in concentration of particulate matter. These studies have highlighted the public concerns about health effects of particulate air pollution. Modeling fine particulate matter PM2.5 exposure risk and monitoring day-to-day changes in PM2.5 concentration is a critical step for understanding the pollution problem and embarking on the necessary remedy. This research designs, implements and compares two inverse distance weighting (IDW)-based spatiotemporal interpolation methods, in order to assess the trend of daily PM2.5 concentration for the contiguous United States over the year of 2009, at both the census block group level and county level. Traditionally, when handling spatiotemporal interpolation, researchers tend to treat space and time separately and reduce the spatiotemporal interpolation problems to a sequence of snapshots of spatial interpolations. In this paper, PM2.5 data interpolation is conducted in the continuous space-time domain by integrating space and time simultaneously, using the so-called extension approach. Time values are calculated with the help of a factor under the assumption that spatial and temporal dimensions are equally important when interpolating a continuous changing phenomenon in the space-time domain. Various IDW-based spatiotemporal interpolation methods with different parameter configurations are evaluated by cross-validation. In addition, this study explores computational issues (computer processing speed) faced during implementation of spatiotemporal interpolation for huge data sets. Parallel programming techniques and an advanced data structure, named k-d tree, are adapted in this paper to address the computational challenges. Significant computational improvement has been achieved. Finally, a web-based spatiotemporal IDW-based interpolation application is designed and implemented where users can visualize and animate spatiotemporal interpolation

  6. Parallel Consensual Neural Networks

    NASA Technical Reports Server (NTRS)

    Benediktsson, J. A.; Sveinsson, J. R.; Ersoy, O. K.; Swain, P. H.

    1993-01-01

    A new neural network architecture is proposed and applied in classification of remote sensing/geographic data from multiple sources. The new architecture is called the parallel consensual neural network and its relation to hierarchical and ensemble neural networks is discussed. The parallel consensual neural network architecture is based on statistical consensus theory. The input data are transformed several times and the different transformed data are applied as if they were independent inputs and are classified using stage neural networks. Finally, the outputs from the stage networks are then weighted and combined to make a decision. Experimental results based on remote sensing data and geographic data are given. The performance of the consensual neural network architecture is compared to that of a two-layer (one hidden layer) conjugate-gradient backpropagation neural network. The results with the proposed neural network architecture compare favorably in terms of classification accuracy to the backpropagation method.

  7. Parallel Subconvolution Filtering Architectures

    NASA Technical Reports Server (NTRS)

    Gray, Andrew A.

    2003-01-01

    These architectures are based on methods of vector processing and the discrete-Fourier-transform/inverse-discrete- Fourier-transform (DFT-IDFT) overlap-and-save method, combined with time-block separation of digital filters into frequency-domain subfilters implemented by use of sub-convolutions. The parallel-processing method implemented in these architectures enables the use of relatively small DFT-IDFT pairs, while filter tap lengths are theoretically unlimited. The size of a DFT-IDFT pair is determined by the desired reduction in processing rate, rather than on the order of the filter that one seeks to implement. The emphasis in this report is on those aspects of the underlying theory and design rules that promote computational efficiency, parallel processing at reduced data rates, and simplification of the designs of very-large-scale integrated (VLSI) circuits needed to implement high-order filters and correlators.

  8. Parallel grid population

    SciTech Connect

    Wald, Ingo; Ize, Santiago

    2015-07-28

    Parallel population of a grid with a plurality of objects using a plurality of processors. One example embodiment is a method for parallel population of a grid with a plurality of objects using a plurality of processors. The method includes a first act of dividing a grid into n distinct grid portions, where n is the number of processors available for populating the grid. The method also includes acts of dividing a plurality of objects into n distinct sets of objects, assigning a distinct set of objects to each processor such that each processor determines by which distinct grid portion(s) each object in its distinct set of objects is at least partially bounded, and assigning a distinct grid portion to each processor such that each processor populates its distinct grid portion with any objects that were previously determined to be at least partially bounded by its distinct grid portion.

  9. Parallel Anisotropic Tetrahedral Adaptation

    NASA Technical Reports Server (NTRS)

    Park, Michael A.; Darmofal, David L.

    2008-01-01

    An adaptive method that robustly produces high aspect ratio tetrahedra to a general 3D metric specification without introducing hybrid semi-structured regions is presented. The elemental operators and higher-level logic is described with their respective domain-decomposed parallelizations. An anisotropic tetrahedral grid adaptation scheme is demonstrated for 1000-1 stretching for a simple cube geometry. This form of adaptation is applicable to more complex domain boundaries via a cut-cell approach as demonstrated by a parallel 3D supersonic simulation of a complex fighter aircraft. To avoid the assumptions and approximations required to form a metric to specify adaptation, an approach is introduced that directly evaluates interpolation error. The grid is adapted to reduce and equidistribute this interpolation error calculation without the use of an intervening anisotropic metric. Direct interpolation error adaptation is illustrated for 1D and 3D domains.

  10. Parallel multilevel preconditioners

    SciTech Connect

    Bramble, J.H.; Pasciak, J.E.; Xu, Jinchao.

    1989-01-01

    In this paper, we shall report on some techniques for the development of preconditioners for the discrete systems which arise in the approximation of solutions to elliptic boundary value problems. Here we shall only state the resulting theorems. It has been demonstrated that preconditioned iteration techniques often lead to the most computationally effective algorithms for the solution of the large algebraic systems corresponding to boundary value problems in two and three dimensional Euclidean space. The use of preconditioned iteration will become even more important on computers with parallel architecture. This paper discusses an approach for developing completely parallel multilevel preconditioners. In order to illustrate the resulting algorithms, we shall describe the simplest application of the technique to a model elliptic problem.

  11. Ultrascalable petaflop parallel supercomputer

    DOEpatents

    Blumrich, Matthias A.; Chen, Dong; Chiu, George; Cipolla, Thomas M.; Coteus, Paul W.; Gara, Alan G.; Giampapa, Mark E.; Hall, Shawn; Haring, Rudolf A.; Heidelberger, Philip; Kopcsay, Gerard V.; Ohmacht, Martin; Salapura, Valentina; Sugavanam, Krishnan; Takken, Todd

    2010-07-20

    A massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements. The ASIC nodes are interconnected by multiple independent networks that optimally maximize the throughput of packet communications between nodes with minimal latency. The multiple networks may include three high-speed networks for parallel algorithm message passing including a Torus, collective network, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. The use of a DMA engine is provided to facilitate message passing among the nodes without the expenditure of processing resources at the node.

  12. PCLIPS: Parallel CLIPS

    NASA Technical Reports Server (NTRS)

    Gryphon, Coranth D.; Miller, Mark D.

    1991-01-01

    PCLIPS (Parallel CLIPS) is a set of extensions to the C Language Integrated Production System (CLIPS) expert system language. PCLIPS is intended to provide an environment for the development of more complex, extensive expert systems. Multiple CLIPS expert systems are now capable of running simultaneously on separate processors, or separate machines, thus dramatically increasing the scope of solvable tasks within the expert systems. As a tool for parallel processing, PCLIPS allows for an expert system to add to its fact-base information generated by other expert systems, thus allowing systems to assist each other in solving a complex problem. This allows individual expert systems to be more compact and efficient, and thus run faster or on smaller machines.

  13. Homology, convergence and parallelism.

    PubMed

    Ghiselin, Michael T

    2016-01-01

    Homology is a relation of correspondence between parts of parts of larger wholes. It is used when tracking objects of interest through space and time and in the context of explanatory historical narratives. Homologues can be traced through a genealogical nexus back to a common ancestral precursor. Homology being a transitive relation, homologues remain homologous however much they may come to differ. Analogy is a relationship of correspondence between parts of members of classes having no relationship of common ancestry. Although homology is often treated as an alternative to convergence, the latter is not a kind of correspondence: rather, it is one of a class of processes that also includes divergence and parallelism. These often give rise to misleading appearances (homoplasies). Parallelism can be particularly hard to detect, especially when not accompanied by divergences in some parts of the body. PMID:26598721

  14. Collisionless parallel shocks

    NASA Technical Reports Server (NTRS)

    Khabibrakhmanov, I. KH.; Galeev, A. A.; Galinskii, V. L.

    1993-01-01

    Consideration is given to a collisionless parallel shock based on solitary-type solutions of the modified derivative nonlinear Schroedinger equation (MDNLS) for parallel Alfven waves. The standard derivative nonlinear Schroedinger equation is generalized in order to include the possible anisotropy of the plasma distribution and higher-order Korteweg-de Vies-type dispersion. Stationary solutions of MDNLS are discussed. The anisotropic nature of 'adiabatic' reflections leads to the asymmetric particle distribution in the upstream as well as in the downstream regions of the shock. As a result, nonzero heat flux appears near the front of the shock. It is shown that this causes the stochastic behavior of the nonlinear waves, which can significantly contribute to the shock thermalization.

  15. Requirements for supercomputing in energy research: The transition to massively parallel computing

    SciTech Connect

    Not Available

    1993-02-01

    This report discusses: The emergence of a practical path to TeraFlop computing and beyond; requirements of energy research programs at DOE; implementation: supercomputer production computing environment on massively parallel computers; and implementation: user transition to massively parallel computing.

  16. The 2nd Symposium on the Frontiers of Massively Parallel Computations

    NASA Technical Reports Server (NTRS)

    Mills, Ronnie (Editor)

    1988-01-01

    Programming languages, computer graphics, neural networks, massively parallel computers, SIMD architecture, algorithms, digital terrain models, sort computation, simulation of charged particle transport on the massively parallel processor and image processing are among the topics discussed.

  17. ASSEMBLY OF PARALLEL PLATES

    DOEpatents

    Groh, E.F.; Lennox, D.H.

    1963-04-23

    This invention is concerned with a rigid assembly of parallel plates in which keyways are stamped out along the edges of the plates and a self-retaining key is inserted into aligned keyways. Spacers having similar keyways are included between adjacent plates. The entire assembly is locked into a rigid structure by fastening only the outermost plates to the ends of the keys. (AEC)

  18. Parallel sphere rendering

    SciTech Connect

    Krogh, M.; Painter, J.; Hansen, C.

    1996-10-01

    Sphere rendering is an important method for visualizing molecular dynamics data. This paper presents a parallel algorithm that is almost 90 times faster than current graphics workstations. To render extremely large data sets and large images, the algorithm uses the MIMD features of the supercomputers to divide up the data, render independent partial images, and then finally composite the multiple partial images using an optimal method. The algorithm and performance results are presented for the CM-5 and the M.

  19. Xyce parallel electronic simulator.

    SciTech Connect

    Keiter, Eric R; Mei, Ting; Russo, Thomas V.; Rankin, Eric Lamont; Schiek, Richard Louis; Thornquist, Heidi K.; Fixel, Deborah A.; Coffey, Todd S; Pawlowski, Roger P; Santarelli, Keith R.

    2010-05-01

    This document is a reference guide to the Xyce Parallel Electronic Simulator, and is a companion document to the Xyce Users Guide. The focus of this document is (to the extent possible) exhaustively list device parameters, solver options, parser options, and other usage details of Xyce. This document is not intended to be a tutorial. Users who are new to circuit simulation are better served by the Xyce Users Guide.

  20. “Everybody Brush!”: Protocol for a Parallel-Group Randomized Controlled Trial of a Family-Focused Primary Prevention Program With Distribution of Oral Hygiene Products and Education to Increase Frequency of Toothbrushing

    PubMed Central

    2015-01-01

    Background Twice daily toothbrushing with fluoridated toothpaste is the most widely advocated preventive strategy for dental caries (tooth decay) and is recommended by professional dental associations. Not all parents, children, or adolescents follow this recommendation. This protocol describes the methods for the implementation and evaluation of a quality improvement health promotion program. Objective The objective of the study is to show a theory-informed, evidence-based program to improve twice daily toothbrushing and oral health-related quality of life that may reduce dental caries, dental treatment need, and costs. Methods The design is a parallel-group, pragmatic randomized controlled trial. Families of Medicaid-insured children and adolescents within a large dental care organization in central Oregon will participate in the trial (n=21,743). Families will be assigned to one of three groups: a test intervention, an active control, or a passive control condition. The intervention aims to address barriers and support for twice-daily toothbrushing. Families in the test condition will receive toothpaste and toothbrushes by mail for all family members every three months. In addition, they will receive education and social support to encourage toothbrushing via postcards, recorded telephone messages, and an optional participant-initiated telephone helpline. Families in the active control condition will receive the kit of supplies by mail, but no additional instructional information or telephone support. Families assigned to the passive control will be on a waiting list. The primary outcomes are restorative dental care received and, only for children younger than 36 months old at baseline, the frequency of twice-daily toothbrushing. Data will be collected through dental claims records and, for children younger than 36 months old at baseline, parent interviews and clinical exams. Results Enrollment of participants and baseline interviews have been completed. Final

  1. Parallel algorithms for optical digital computers

    SciTech Connect

    Huang, A.

    1983-01-01

    Conventional computers suffer from several communication bottlenecks which fundamentally limit their performance. These bottlenecks are characterised by an address-dependent sequential transfer of information which arises from the need to time-multiplex information over a limited number of interconnections. An optical digital computer based on a classical finite state machine can be shown to be free of these bottlenecks. Such a processor would be unique since it would be capable of modifying its entire state space each cycle while conventional computers can only alter a few bits. New algorithms are needed to manage and use this capability. A technique based on recognising a particular symbol in parallel and replacing it in parallel with another symbol is suggested. Examples using this parallel symbolic substitution to perform binary addition and binary incrementation are presented. Applications involving Boolean logic, functional programming languages, production rule driven artificial intelligence, and molecular chemistry are also discussed. 12 references.

  2. Loop parallelism on Tera MTA using SISAL

    SciTech Connect

    Mitrovic, S.

    1995-11-01

    The difficulty of programming parallel computers has impeded their wide-spread use. The problems are caused by existing hardware and software tools. The software problems on shared-memory and vector computers can be solved by using deterministic high-performance functional languages like SISAL. Distributed-memory computers have even more obstacles than shared-memory parallel machines. Research indicates that multithreaded architectures can hide long latency of distributed memories and that they can solve the problems of locality. Tera`s MTA multiprocessor is based on the concept of multithreading and provides the programmer with a real shared-memory model. This paper investigates the performance of parallel loops written in SISAL and executed on the Tera MTA using the Livermore Loops benchmarks.

  3. Numerical wind tunnel and parallel FORTRAN

    NASA Astrophysics Data System (ADS)

    Nakamura, Takashi; Yoshida, Masahiro; Fukuda, Masahiro; Takamura, Moriyuki; Okada, Shin

    1992-12-01

    Computational Fluid Dynamics (CFD) requires computers 100 times faster than the Fujitsu VP400 in effective speed. Such a processor can be suitably called the 'Numerical Wind Tunnel'. Numerical Wind Tunnel (NWT) is a parallel computer system of a distributed memory architecture composed of vector processors connected through cross-bar network. In this report, the system configuration, processing element, and interconnection network and communication mechanism of the NWT are shown. Fundamental functions global data, parallel execution of DO-loop, and data decomposition and allocation, which the language-processor system has to provide in order to realize parallel execution on the NWT are also shown. FORTRAN 77 is chosen as a basic programming language for NWT and some compiler directives are added to make effective use of the NWT.

  4. QCD on the Massively Parallel Computer AP1000

    NASA Astrophysics Data System (ADS)

    Akemi, K.; Fujisaki, M.; Okuda, M.; Tago, Y.; Hashimoto, T.; Hioki, S.; Miyamura, O.; Takaishi, T.; Nakamura, A.; de Forcrand, Ph.; Hege, C.; Stamatescu, I. O.

    We present the QCD-TARO program of calculations which uses the parallel computer AP1000 of Fujitsu. We discuss the results on scaling, correlation times and hadronic spectrum, some aspects of the implementation and the future prospects.

  5. The EMCC / DARPA Massively Parallel Electromagnetic Scattering Project

    NASA Technical Reports Server (NTRS)

    Woo, Alex C.; Hill, Kueichien C.

    1996-01-01

    The Electromagnetic Code Consortium (EMCC) was sponsored by the Advanced Research Program Agency (ARPA) to demonstrate the effectiveness of massively parallel computing in large scale radar signature predictions. The EMCC/ARPA project consisted of three parts.

  6. Trajectory optimization using parallel shooting method on parallel computer

    SciTech Connect

    Wirthman, D.J.; Park, S.Y.; Vadali, S.R.

    1995-03-01

    The efficiency of a parallel shooting method on a parallel computer for solving a variety of optimal control guidance problems is studied. Several examples are considered to demonstrate that a speedup of nearly 7 to 1 is achieved with the use of 16 processors. It is suggested that further improvements in performance can be achieved by parallelizing in the state domain. 10 refs.

  7. Parallel Computing for Probabilistic Response Analysis of High Temperature Composites

    NASA Technical Reports Server (NTRS)

    Sues, R. H.; Lua, Y. J.; Smith, M. D.

    1994-01-01

    The objective of this Phase I research was to establish the required software and hardware strategies to achieve large scale parallelism in solving PCM problems. To meet this objective, several investigations were conducted. First, we identified the multiple levels of parallelism in PCM and the computational strategies to exploit these parallelisms. Next, several software and hardware efficiency investigations were conducted. These involved the use of three different parallel programming paradigms and solution of two example problems on both a shared-memory multiprocessor and a distributed-memory network of workstations.

  8. Analysis and research on the balanced distribution of the network-based data in parallel database

    NASA Astrophysics Data System (ADS)

    He, JunHua

    2011-12-01

    The rapid development of parallel computer systems, making parallel operating environment gradually mature and widely used in scientific computing and research in many fields, thus parallel database of research becomes more and more attention and research has become an important database field of study. This network-based parallel cluster of characteristics and the current parallel computer system new trends, analyzes the network parallel clusters of workstations, parallel database data skew problem in data distribution characteristics of the environment is proposed with the ability to adapt to the dynamic data balanced distribution programs.

  9. A massively parallel memory-based story system for psychotherapy.

    PubMed

    Smith, R N; Chen, C C; Feng, F F; Gomez-Gauchia, H

    1993-10-01

    We describe a memory-based system for psychotherapy, Dr. Bob, built to run on the data parallel processor Thinking Machines, Inc., CM-2a Connection Machine. The system retrieves, in parallel, stories of alcohol addiction and sexual abuse which can be used by psychiatrists in working with their patients as part of their work in recovering from addictive behavior and psychological trauma. The program is written in *LISP (pronounced Star LISP), a version of LISP used in programming Connection Machines. PMID:8243067

  10. Parallel search of strongly ordered game trees

    SciTech Connect

    Marsland, T.A.; Campbell, M.

    1982-12-01

    The alpha-beta algorithm forms the basis of many programs that search game trees. A number of methods have been designed to improve the utility of the sequential version of this algorithm, especially for use in game-playing programs. These enhancements are based on the observation that alpha beta is most effective when the best move in each position is considered early in the search. Trees that have this so-called strong ordering property are not only of practical importance but possess characteristics that can be exploited in both sequential and parallel environments. This paper draws upon experiences gained during the development of programs which search chess game trees. Over the past decade major enhancements of the alpha beta algorithm have been developed by people building game-playing programs, and many of these methods will be surveyed and compared here. The balance of the paper contains a study of contemporary methods for searching chess game trees in parallel, using an arbitrary number of independent processors. To make efficient use of these processors, one must have a clear understanding of the basic properties of the trees actually traversed when alpha-beta cutoffs occur. This paper provides such insights and concludes with a brief description of a refinement to a standard parallel search algorithm for this problem. 33 references.

  11. Resistor Combinations for Parallel Circuits.

    ERIC Educational Resources Information Center

    McTernan, James P.

    1978-01-01

    To help simplify both teaching and learning of parallel circuits, a high school electricity/electronics teacher presents and illustrates the use of tables of values for parallel resistive circuits in which total resistances are whole numbers. (MF)

  12. The Galley Parallel File System

    NASA Technical Reports Server (NTRS)

    Nieuwejaar, Nils; Kotz, David

    1996-01-01

    As the I/O needs of parallel scientific applications increase, file systems for multiprocessors are being designed to provide applications with parallel access to multiple disks. Many parallel file systems present applications with a conventional Unix-like interface that allows the application to access multiple disks transparently. The interface conceals the parallelism within the file system, which increases the ease of programmability, but makes it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. Furthermore, most current parallel file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic parallel workloads. We discuss Galley's file structure and application interface, as well as an application that has been implemented using that interface.

  13. Asynchronous interpretation of parallel microprograms

    SciTech Connect

    Bandman, O.L.

    1984-03-01

    In this article, the authors demonstrate how to pass from a given synchronous interpretation of a parallel microprogram to an equivalent asynchronous interpretation, and investigate the cost associated with the rejection of external synchronization in parallel microprogram structures.

  14. Linear Bregman algorithm implemented in parallel GPU

    NASA Astrophysics Data System (ADS)

    Li, Pengyan; Ke, Jue; Sui, Dong; Wei, Ping

    2015-08-01

    At present, most compressed sensing (CS) algorithms have poor converging speed, thus are difficult to run on PC. To deal with this issue, we use a parallel GPU, to implement a broadly used compressed sensing algorithm, the Linear Bregman algorithm. Linear iterative Bregman algorithm is a reconstruction algorithm proposed by Osher and Cai. Compared with other CS reconstruction algorithms, the linear Bregman algorithm only involves the vector and matrix multiplication and thresholding operation, and is simpler and more efficient for programming. We use C as a development language and adopt CUDA (Compute Unified Device Architecture) as parallel computing architectures. In this paper, we compared the parallel Bregman algorithm with traditional CPU realized Bregaman algorithm. In addition, we also compared the parallel Bregman algorithm with other CS reconstruction algorithms, such as OMP and TwIST algorithms. Compared with these two algorithms, the result of this paper shows that, the parallel Bregman algorithm needs shorter time, and thus is more convenient for real-time object reconstruction, which is important to people's fast growing demand to information technology.

  15. Parallelized nested sampling

    NASA Astrophysics Data System (ADS)

    Henderson, R. Wesley; Goggans, Paul M.

    2014-12-01

    One of the important advantages of nested sampling as an MCMC technique is its ability to draw representative samples from multimodal distributions and distributions with other degeneracies. This coverage is accomplished by maintaining a number of so-called live samples within a likelihood constraint. In usual practice, at each step, only the sample with the least likelihood is discarded from this set of live samples and replaced. In [1], Skilling shows that for a given number of live samples, discarding only one sample yields the highest precision in estimation of the log-evidence. However, if we increase the number of live samples, more samples can be discarded at once while still maintaining the same precision. For computer code running only serially, this modification would considerably increase the wall clock time necessary to reach convergence. However, if we use a computer with parallel processing capabilities, and we write our code to take advantage of this parallelism to replace multiple samples concurrently, the performance penalty can be eliminated entirely and possibly reversed. In this case, we must use the more general equation in [1] for computing the expectation of the shrinkage distribution: E [- log t]= (N r-r+1)-1+(Nr-r+2)-1+⋯+Nr-1, for shrinkage t with Nr live samples and r samples discarded at each iteration. The equation for the variance Var (- log t)= (N r-r+1)-2+(Nr-r+2)-2+⋯+Nr-2 is used to find the appropriate number of live samples Nr to use with r > 1 to match the variance achieved with N1 live samples and r = 1. In this paper, we show that by replacing multiple discarded samples in parallel, we are able to achieve a more thorough sampling of the constrained prior distribution, reduce runtime, and increase precision.

  16. Parallel and Portable Monte Carlo Particle Transport

    NASA Astrophysics Data System (ADS)

    Lee, S. R.; Cummings, J. C.; Nolen, S. D.; Keen, N. D.

    1997-08-01

    We have developed a multi-group, Monte Carlo neutron transport code in C++ using object-oriented methods and the Parallel Object-Oriented Methods and Applications (POOMA) class library. This transport code, called MC++, currently computes k and α eigenvalues of the neutron transport equation on a rectilinear computational mesh. It is portable to and runs in parallel on a wide variety of platforms, including MPPs, clustered SMPs, and individual workstations. It contains appropriate classes and abstractions for particle transport and, through the use of POOMA, for portable parallelism. Current capabilities are discussed, along with physics and performance results for several test problems on a variety of hardware, including all three Accelerated Strategic Computing Initiative (ASCI) platforms. Current parallel performance indicates the ability to compute α-eigenvalues in seconds or minutes rather than days or weeks. Current and future work on the implementation of a general transport physics framework (TPF) is also described. This TPF employs modern C++ programming techniques to provide simplified user interfaces, generic STL-style programming, and compile-time performance optimization. Physics capabilities of the TPF will be extended to include continuous energy treatments, implicit Monte Carlo algorithms, and a variety of convergence acceleration techniques such as importance combing.

  17. Parallel sphere rendering

    SciTech Connect

    Krogh, M.; Hansen, C.; Painter, J.; de Verdiere, G.C.

    1995-05-01

    Sphere rendering is an important method for visualizing molecular dynamics data. This paper presents a parallel divide-and-conquer algorithm that is almost 90 times faster than current graphics workstations. To render extremely large data sets and large images, the algorithm uses the MIMD features of the supercomputers to divide up the data, render independent partial images, and then finally composite the multiple partial images using an optimal method. The algorithm and performance results are presented for the CM-5 and the T3D.

  18. Highly parallel computation

    NASA Technical Reports Server (NTRS)

    Denning, Peter J.; Tichy, Walter F.

    1990-01-01

    Highly parallel computing architectures are the only means to achieve the computation rates demanded by advanced scientific problems. A decade of research has demonstrated the feasibility of such machines and current research focuses on which architectures designated as multiple instruction multiple datastream (MIMD) and single instruction multiple datastream (SIMD) have produced the best results to date; neither shows a decisive advantage for most near-homogeneous scientific problems. For scientific problems with many dissimilar parts, more speculative architectures such as neural networks or data flow may be needed.

  19. Parallel Kinematic Machines (PKM)

    SciTech Connect

    Henry, R.S.

    2000-03-17

    The purpose of this 3-year cooperative research project was to develop a parallel kinematic machining (PKM) capability for complex parts that normally require expensive multiple setups on conventional orthogonal machine tools. This non-conventional, non-orthogonal machining approach is based on a 6-axis positioning system commonly referred to as a hexapod. Sandia National Laboratories/New Mexico (SNL/NM) was the lead site responsible for a multitude of projects that defined the machining parameters and detailed the metrology of the hexapod. The role of the Kansas City Plant (KCP) in this project was limited to evaluating the application of this unique technology to production applications.

  20. Parallel image computation in clusters with task-distributor.

    PubMed

    Baun, Christian

    2016-01-01

    Distributed systems, especially clusters, can be used to execute ray tracing tasks in parallel for speeding up the image computation. Because ray tracing is a computational expensive and memory consuming task, ray tracing can also be used to benchmark clusters. This paper introduces task-distributor, a free software solution for the parallel execution of ray tracing tasks in distributed systems. The ray tracing solution used for this work is the Persistence Of Vision Raytracer (POV-Ray). Task-distributor does not require any modification of the POV-Ray source code or the installation of an additional message passing library like the Message Passing Interface or Parallel Virtual Machine to allow parallel image computation, in contrast to various other projects. By analyzing the runtime of the sequential and parallel program parts of task-distributor, it becomes clear how the problem size and available hardware resources influence the scaling of the parallel application.

  1. Parallel image computation in clusters with task-distributor.

    PubMed

    Baun, Christian

    2016-01-01

    Distributed systems, especially clusters, can be used to execute ray tracing tasks in parallel for speeding up the image computation. Because ray tracing is a computational expensive and memory consuming task, ray tracing can also be used to benchmark clusters. This paper introduces task-distributor, a free software solution for the parallel execution of ray tracing tasks in distributed systems. The ray tracing solution used for this work is the Persistence Of Vision Raytracer (POV-Ray). Task-distributor does not require any modification of the POV-Ray source code or the installation of an additional message passing library like the Message Passing Interface or Parallel Virtual Machine to allow parallel image computation, in contrast to various other projects. By analyzing the runtime of the sequential and parallel program parts of task-distributor, it becomes clear how the problem size and available hardware resources influence the scaling of the parallel application. PMID:27330898

  2. CSM parallel structural methods research

    NASA Technical Reports Server (NTRS)

    Storaasli, Olaf O.

    1989-01-01

    Parallel structural methods, research team activities, advanced architecture computers for parallel computational structural mechanics (CSM) research, the FLEX/32 multicomputer, a parallel structural analyses testbed, blade-stiffened aluminum panel with a circular cutout and the dynamic characteristics of a 60 meter, 54-bay, 3-longeron deployable truss beam are among the topics discussed.

  3. Roo: A parallel theorem prover

    SciTech Connect

    Lusk, E.L.; McCune, W.W.; Slaney, J.K.

    1991-11-01

    We describe a parallel theorem prover based on the Argonne theorem-proving system OTTER. The parallel system, called Roo, runs on shared-memory multiprocessors such as the Sequent Symmetry. We explain the parallel algorithm used and give performance results that demonstrate near-linear speedups on large problems.

  4. Review of parallel computing methods and tools for FPGA technology

    NASA Astrophysics Data System (ADS)

    Cieszewski, Radosław; Linczuk, Maciej; Pozniak, Krzysztof; Romaniuk, Ryszard

    2013-10-01

    Parallel computing is emerging as an important area of research in computer architectures and software systems. Many algorithms can be greatly accelerated using parallel computing techniques. Specialized parallel computer architectures are used for accelerating speci c tasks. High-Energy Physics Experiments measuring systems often use FPGAs for ne-grained computation. FPGA combines many bene ts of both software and ASIC implementations. Like software, the mapped circuit is exible, and can be recon gured over the lifetime of the system. FPGAs therefore have the potential to achieve far greater performance than software as a result of bypassing the fetch-decode-execute operations of traditional processors, and possibly exploiting a greater level of parallelism. Creating parallel programs implemented in FPGAs is not trivial. This paper presents existing methods and tools for ne-grained computation implemented in FPGA using Behavioral Description and High Level Programming Languages.

  5. Parallel community climate model: Description and user`s guide

    SciTech Connect

    Drake, J.B.; Flanery, R.E.; Semeraro, B.D.; Worley, P.H.

    1996-07-15

    This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain into geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.

  6. Parallels and Divergences?

    ERIC Educational Resources Information Center

    Spray, Martin

    1985-01-01

    Discusses the varying philosophical viewpoints and program orientations associated with the conservation movement, assessing the implications of these divergences on the objectives and instructional methods of environmental education. Also identifies and explains the range of differences existing in environmental education programs. (ML)

  7. Parallel ptychographic reconstruction

    PubMed Central

    Nashed, Youssef S. G.; Vine, David J.; Peterka, Tom; Deng, Junjing; Ross, Rob; Jacobsen, Chris

    2014-01-01

    Ptychography is an imaging method whereby a coherent beam is scanned across an object, and an image is obtained by iterative phasing of the set of diffraction patterns. It is able to be used to image extended objects at a resolution limited by scattering strength of the object and detector geometry, rather than at an optics-imposed limit. As technical advances allow larger fields to be imaged, computational challenges arise for reconstructing the correspondingly larger data volumes, yet at the same time there is also a need to deliver reconstructed images immediately so that one can evaluate the next steps to take in an experiment. Here we present a parallel method for real-time ptychographic phase retrieval. It uses a hybrid parallel strategy to divide the computation between multiple graphics processing units (GPUs) and then employs novel techniques to merge sub-datasets into a single complex phase and amplitude image. Results are shown on a simulated specimen and a real dataset from an X-ray experiment conducted at a synchrotron light source. PMID:25607174

  8. Making parallel lines meet

    PubMed Central

    Baskin, Tobias I.; Gu, Ying

    2012-01-01

    The extracellular matrix is constructed beyond the plasma membrane, challenging mechanisms for its control by the cell. In plants, the cell wall is highly ordered, with cellulose microfibrils aligned coherently over a scale spanning hundreds of cells. To a considerable extent, deploying aligned microfibrils determines mechanical properties of the cell wall, including strength and compliance. Cellulose microfibrils have long been seen to be aligned in parallel with an array of microtubules in the cell cortex. How do these cortical microtubules affect the cellulose synthase complex? This question has stood for as many years as the parallelism between the elements has been observed, but now an answer is emerging. Here, we review recent work establishing that the link between microtubules and microfibrils is mediated by a protein named cellulose synthase-interacting protein 1 (CSI1). The protein binds both microtubules and components of the cellulose synthase complex. In the absence of CSI1, microfibrils are synthesized but their alignment becomes uncoupled from the microtubules, an effect that is phenocopied in the wild type by depolymerizing the microtubules. The characterization of CSI1 significantly enhances knowledge of how cellulose is aligned, a process that serves as a paradigmatic example of how cells dictate the construction of their extracellular environment. PMID:22902763

  9. Applied Parallel Metadata Indexing

    SciTech Connect

    Jacobi, Michael R

    2012-08-01

    The GPFS Archive is parallel archive is a parallel archive used by hundreds of users in the Turquoise collaboration network. It houses 4+ petabytes of data in more than 170 million files. Currently, users must navigate the file system to retrieve their data, requiring them to remember file paths and names. A better solution might allow users to tag data with meaningful labels and searach the archive using standard and user-defined metadata, while maintaining security. last summer, I developed the backend to a tool that adheres to these design goals. The backend works by importing GPFS metadata into a MongoDB cluster, which is then indexed on each attribute. This summer, the author implemented security and developed the user interfae for the search tool. To meet security requirements, each database table is associated with a single user, which only stores records that the user may read, and requires a set of credentials to access. The interface to the search tool is implemented using FUSE (Filesystem in USErspace). FUSE is an intermediate layer that intercepts file system calls and allows the developer to redefine how those calls behave. In the case of this tool, FUSE interfaces with MongoDB to issue queries and populate output. A FUSE implementation is desirable because it allows users to interact with the search tool using commands they are already familiar with. These security and interface additions are essential for a usable product.

  10. The new landscape of parallel computer architecture

    NASA Astrophysics Data System (ADS)

    Shalf, John

    2007-07-01

    The past few years has seen a sea change in computer architecture that will impact every facet of our society as every electronic device from cell phone to supercomputer will need to confront parallelism of unprecedented scale. Whereas the conventional multicore approach (2, 4, and even 8 cores) adopted by the computing industry will eventually hit a performance plateau, the highest performance per watt and per chip area is achieved using manycore technology (hundreds or even thousands of cores). However, fully unleashing the potential of the manycore approach to ensure future advances in sustained computational performance will require fundamental advances in computer architecture and programming models that are nothing short of reinventing computing. In this paper we examine the reasons behind the movement to exponentially increasing parallelism, and its ramifications for system design, applications and programming models.

  11. Inflated speedups in parallel simulations via malloc()

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1990-01-01

    Discrete-event simulation programs make heavy use of dynamic memory allocation in order to support simulation's very dynamic space requirements. When programming in C one is likely to use the malloc() routine. However, a parallel simulation which uses the standard Unix System V malloc() implementation may achieve an overly optimistic speedup, possibly superlinear. An alternate implementation provided on some (but not all systems) can avoid the speedup anomaly, but at the price of significantly reduced available free space. This is especially severe on most parallel architectures, which tend not to support virtual memory. It is shown how a simply implemented user-constructed interface to malloc() can both avoid artificially inflated speedups, and make efficient use of the dynamic memory space. The interface simply catches blocks on the basis of their size. The problem is demonstrated empirically, and the effectiveness of the solution is shown both empirically and analytically.

  12. Parallel computing in enterprise modeling.

    SciTech Connect

    Goldsby, Michael E.; Armstrong, Robert C.; Shneider, Max S.; Vanderveen, Keith; Ray, Jaideep; Heath, Zach; Allan, Benjamin A.

    2008-08-01

    This report presents the results of our efforts to apply high-performance computing to entity-based simulations with a multi-use plugin for parallel computing. We use the term 'Entity-based simulation' to describe a class of simulation which includes both discrete event simulation and agent based simulation. What simulations of this class share, and what differs from more traditional models, is that the result sought is emergent from a large number of contributing entities. Logistic, economic and social simulations are members of this class where things or people are organized or self-organize to produce a solution. Entity-based problems never have an a priori ergodic principle that will greatly simplify calculations. Because the results of entity-based simulations can only be realized at scale, scalable computing is de rigueur for large problems. Having said that, the absence of a spatial organizing principal makes the decomposition of the problem onto processors problematic. In addition, practitioners in this domain commonly use the Java programming language which presents its own problems in a high-performance setting. The plugin we have developed, called the Parallel Particle Data Model, overcomes both of these obstacles and is now being used by two Sandia frameworks: the Decision Analysis Center, and the Seldon social simulation facility. While the ability to engage U.S.-sized problems is now available to the Decision Analysis Center, this plugin is central to the success of Seldon. Because Seldon relies on computationally intensive cognitive sub-models, this work is necessary to achieve the scale necessary for realistic results. With the recent upheavals in the financial markets, and the inscrutability of terrorist activity, this simulation domain will likely need a capability with ever greater fidelity. High-performance computing will play an important part in enabling that greater fidelity.

  13. Sequence Alignment Tools: One Parallel Pattern to Rule Them All?

    PubMed Central

    2014-01-01

    In this paper, we advocate high-level programming methodology for next generation sequencers (NGS) alignment tools for both productivity and absolute performance. We analyse the problem of parallel alignment and review the parallelisation strategies of the most popular alignment tools, which can all be abstracted to a single parallel paradigm. We compare these tools to their porting onto the FastFlow pattern-based programming framework, which provides programmers with high-level parallel patterns. By using a high-level approach, programmers are liberated from all complex aspects of parallel programming, such as synchronisation protocols, and task scheduling, gaining more possibility for seamless performance tuning. In this work, we show some use cases in which, by using a high-level approach for parallelising NGS tools, it is possible to obtain comparable or even better absolute performance for all used datasets. PMID:25147803

  14. Parallel Polarization State Generation

    PubMed Central

    She, Alan; Capasso, Federico

    2016-01-01

    The control of polarization, an essential property of light, is of wide scientific and technological interest. The general problem of generating arbitrary time-varying states of polarization (SOP) has always been mathematically formulated by a series of linear transformations, i.e. a product of matrices, imposing a serial architecture. Here we show a parallel architecture described by a sum of matrices. The theory is experimentally demonstrated by modulating spatially-separated polarization components of a laser using a digital micromirror device that are subsequently beam combined. This method greatly expands the parameter space for engineering devices that control polarization. Consequently, performance characteristics, such as speed, stability, and spectral range, are entirely dictated by the technologies of optical intensity modulation, including absorption, reflection, emission, and scattering. This opens up important prospects for polarization state generation (PSG) with unique performance characteristics with applications in spectroscopic ellipsometry, spectropolarimetry, communications, imaging, and security. PMID:27184813

  15. Parallel tridiagonal equation solvers

    NASA Technical Reports Server (NTRS)

    Stone, H. S.

    1974-01-01

    Three parallel algorithms were compared for the direct solution of tridiagonal linear systems of equations. The algorithms are suitable for computers such as ILLIAC 4 and CDC STAR. For array computers similar to ILLIAC 4, cyclic odd-even reduction has the least operation count for highly structured sets of equations, and recursive doubling has the least count for relatively unstructured sets of equations. Since the difference in operation counts for these two algorithms is not substantial, their relative running times may be more related to overhead operations, which are not measured in this paper. The third algorithm, based on Buneman's Poisson solver, has more arithmetic operations than the others, and appears to be the least favorable. For pipeline computers similar to CDC STAR, cyclic odd-even reduction appears to be the most preferable algorithm for all cases.

  16. Parallel Polarization State Generation

    NASA Astrophysics Data System (ADS)

    She, Alan; Capasso, Federico

    2016-05-01

    The control of polarization, an essential property of light, is of wide scientific and technological interest. The general problem of generating arbitrary time-varying states of polarization (SOP) has always been mathematically formulated by a series of linear transformations, i.e. a product of matrices, imposing a serial architecture. Here we show a parallel architecture described by a sum of matrices. The theory is experimentally demonstrated by modulating spatially-separated polarization components of a laser using a digital micromirror device that are subsequently beam combined. This method greatly expands the parameter space for engineering devices that control polarization. Consequently, performance characteristics, such as speed, stability, and spectral range, are entirely dictated by the technologies of optical intensity modulation, including absorption, reflection, emission, and scattering. This opens up important prospects for polarization state generation (PSG) with unique performance characteristics with applications in spectroscopic ellipsometry, spectropolarimetry, communications, imaging, and security.

  17. Augmenting The HST Pure Parallel Observations

    NASA Astrophysics Data System (ADS)

    Patterson, Alan; Soutchkova, G.; Workman, W.

    2012-05-01

    Pure Parallel (PP) programs, designated GO/PAR, are a subgroup of General Observer (GO) programs. PP execute simultaneously with prime GO observations to which they are "attached". The PP observations can be performed with ACS/WFC, WFC3/UVIS or WFC3/IR and can be attached only to GO visits in which the instruments are either COS or STIS. The current HST Parallel Observation Processing System (POPS) was introduced after the Servicing Mission 4. It increased the HST productivity by 10% in terms of the utilization of HST prime orbits and was highly appreciated by the HST observers, allowing them to design efficient, multi-orbit survey projects for collecting large amounts of data on identifiable targets. The results of the WFC3 Infrared Spectroscopic Parallel Survey (WISP), Hubble Infrared Pure Parallel Imaging Extragalactic Survey (HIPPIES), and The Brightest-of-Reionizing Galaxies Pure Parallel Survey (BoRG) exemplify this benefit. In Cycle 19, however, the full advantage of GO/PARs came under risk. Whereas each of the previous cycles provided over one million seconds of exposure time for PP, in Cycle 19 that number reduced to 680,000 seconds. This dramatic decline occurred because of fundamental changes in the construction of COS prime observations. To preserve the science output of PP, the PP Working Group was tasked to find a way to recover the lost time and maximize the total time available for PP observing. The solution was to expand the definition of a PP opportunity to allow PP exposures to span one or more primary exposure readouts. So starting in HST Cycle 20, PP opportunities will no longer be limited to GO visits with a single uninterrupted exposure in an orbit. The resulting enhancements in HST Cycle 20 to the PP opportunity identification and matching process are expected to restore the PP time to previously achieved and possibly even greater levels.

  18. Parallel Imaging Microfluidic Cytometer

    PubMed Central

    Ehrlich, Daniel J.; McKenna, Brian K.; Evans, James G.; Belkina, Anna C.; Denis, Gerald V.; Sherr, David; Cheung, Man Ching

    2011-01-01

    By adding an additional degree of freedom from multichannel flow, the parallel microfluidic cytometer (PMC) combines some of the best features of flow cytometry (FACS) and microscope-based high-content screening (HCS). The PMC (i) lends itself to fast processing of large numbers of samples, (ii) adds a 1-D imaging capability for intracellular localization assays (HCS), (iii) has a high rare-cell sensitivity and, (iv) has an unusual capability for time-synchronized sampling. An inability to practically handle large sample numbers has restricted applications of conventional flow cytometers and microscopes in combinatorial cell assays, network biology, and drug discovery. The PMC promises to relieve a bottleneck in these previously constrained applications. The PMC may also be a powerful tool for finding rare primary cells in the clinic. The multichannel architecture of current PMC prototypes allows 384 unique samples for a cell-based screen to be read out in approximately 6–10 minutes, about 30-times the speed of most current FACS systems. In 1-D intracellular imaging, the PMC can obtain protein localization using HCS marker strategies at many times the sample throughput of CCD-based microscopes or CCD-based single-channel flow cytometers. The PMC also permits the signal integration time to be varied over a larger range than is practical in conventional flow cytometers. The signal-to-noise advantages are useful, for example, in counting rare positive cells in the most difficult early stages of genome-wide screening. We review the status of parallel microfluidic cytometry and discuss some of the directions the new technology may take. PMID:21704835

  19. A high-speed linear algebra library with automatic parallelism

    NASA Technical Reports Server (NTRS)

    Boucher, Michael L.

    1994-01-01

    Parallel or distributed processing is key to getting highest performance workstations. However, designing and implementing efficient parallel algorithms is difficult and error-prone. It is even more difficult to write code that is both portable to and efficient on many different computers. Finally, it is harder still to satisfy the above requirements and include the reliability and ease of use required of commercial software intended for use in a production environment. As a result, the application of parallel processing technology to commercial software has been extremely small even though there are numerous computationally demanding programs that would significantly benefit from application of parallel processing. This paper describes DSSLIB, which is a library of subroutines that perform many of the time-consuming computations in engineering and scientific software. DSSLIB combines the high efficiency and speed of parallel computation with a serial programming model that eliminates many undesirable side-effects of typical parallel code. The result is a simple way to incorporate the power of parallel processing into commercial software without compromising maintainability, reliability, or ease of use. This gives significant advantages over less powerful non-parallel entries in the market.

  20. Parallel language constructs for tensor product computations on loosely coupled architectures

    NASA Technical Reports Server (NTRS)

    Mehrotra, Piyush; Vanrosendale, John

    1989-01-01

    Distributed memory architectures offer high levels of performance and flexibility, but have proven awkard to program. Current languages for nonshared memory architectures provide a relatively low level programming environment, and are poorly suited to modular programming, and to the construction of libraries. A set of language primitives designed to allow the specification of parallel numerical algorithms at a higher level is described. Tensor product array computations are focused on along with a simple but important class of numerical algorithms. The problem of programming 1-D kernal routines is focused on first, such as parallel tridiagonal solvers, and then how such parallel kernels can be combined to form parallel tensor product algorithms is examined.

  1. High-Performance Parallel Computing. Final report, 1 February 1984-31 January 1985

    SciTech Connect

    Browne, J.C.; Lipovski, G.J.

    1986-01-22

    The 1984/85 accomplishments of the research project High-Performance Parallel Computing included bringing the prototype of the Texas Reconfigurable Array Computer (TRAC) to a configuration and to a state of stability where it could support execution of simple assembly language programs; initial development of a unified model of parallel computation which is a basis for a programming environment uniting process and data flow models of parallel computation; bringing to operational status on an alternative host one of the two parallel programming languages (the Computation Structures Language, CSL) originally intended for use on TRAC; exploration of the expressive capabilities of this programming language; initiation of development of a graphical programming language based on the unified model of parallel computation mentioned previously; major progress on a graphically interfaced Petri Net-based performance modeling system for parallel computations and development of algorithms for scheduling of circuits to realize configurations in configurable Banyan network-based computer architectures.

  2. Parallel contact detection algorithm for transient solid dynamics simulations using PRONTO3D

    SciTech Connect

    Attaway, S.W.; Hendrickson, B.A.; Plimpton, S.J.

    1996-09-01

    An efficient, scalable, parallel algorithm for treating material surface contacts in solid mechanics finite element programs has been implemented in a modular way for MIMD parallel computers. The serial contact detection algorithm that was developed previously for the transient dynamics finite element code PRONTO3D has been extended for use in parallel computation by devising a dynamic (adaptive) processor load balancing scheme.

  3. Parallelized CCHE2D flow model with CUDA Fortran on Graphics Process Units

    Technology Transfer Automated Retrieval System (TEKTRAN)

    This paper presents the CCHE2D implicit flow model parallelized using CUDA Fortran programming technique on Graphics Processing Units (GPUs). A parallelized implicit Alternating Direction Implicit (ADI) solver using Parallel Cyclic Reduction (PCR) algorithm on GPU is developed and tested. This solve...

  4. Parallelized reliability estimation of reconfigurable computer networks

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Das, Subhendu; Palumbo, Dan

    1990-01-01

    A parallelized system, ASSURE, for computing the reliability of embedded avionics flight control systems which are able to reconfigure themselves in the event of failure is described. ASSURE accepts a grammar that describes a reliability semi-Markov state-space. From this it creates a parallel program that simultaneously generates and analyzes the state-space, placing upper and lower bounds on the probability of system failure. ASSURE is implemented on a 32-node Intel iPSC/860, and has achieved high processor efficiencies on real problems. Through a combination of improved algorithms, exploitation of parallelism, and use of an advanced microprocessor architecture, ASSURE has reduced the execution time on substantial problems by a factor of one thousand over previous workstation implementations. Furthermore, ASSURE's parallel execution rate on the iPSC/860 is an order of magnitude faster than its serial execution rate on a Cray-2 supercomputer. While dynamic load balancing is necessary for ASSURE's good performance, it is needed only infrequently; the particular method of load balancing used does not substantially affect performance.

  5. Parallelization and automatic data distribution for nuclear reactor simulations

    SciTech Connect

    Liebrock, L.M.

    1997-07-01

    Detailed attempts at realistic nuclear reactor simulations currently take many times real time to execute on high performance workstations. Even the fastest sequential machine can not run these simulations fast enough to ensure that the best corrective measure is used during a nuclear accident to prevent a minor malfunction from becoming a major catastrophe. Since sequential computers have nearly reached the speed of light barrier, these simulations will have to be run in parallel to make significant improvements in speed. In physical reactor plants, parallelism abounds. Fluids flow, controls change, and reactions occur in parallel with only adjacent components directly affecting each other. These do not occur in the sequentialized manner, with global instantaneous effects, that is often used in simulators. Development of parallel algorithms that more closely approximate the real-world operation of a reactor may, in addition to speeding up the simulations, actually improve the accuracy and reliability of the predictions generated. Three types of parallel architecture (shared memory machines, distributed memory multicomputers, and distributed networks) are briefly reviewed as targets for parallelization of nuclear reactor simulation. Various parallelization models (loop-based model, shared memory model, functional model, data parallel model, and a combined functional and data parallel model) are discussed along with their advantages and disadvantages for nuclear reactor simulation. A variety of tools are introduced for each of the models. Emphasis is placed on the data parallel model as the primary focus for two-phase flow simulation. Tools to support data parallel programming for multiple component applications and special parallelization considerations are also discussed.

  6. Parallel processing and expert systems

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Lau, Sonie

    1991-01-01

    Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 90's cannot enjoy an increased level of autonomy without the efficient use of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real time demands are met for large expert systems. Speed-up via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial labs in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems was surveyed. The survey is divided into three major sections: (1) multiprocessors for parallel expert systems; (2) parallel languages for symbolic computations; and (3) measurements of parallelism of expert system. Results to date indicate that the parallelism achieved for these systems is small. In order to obtain greater speed-ups, data parallelism and application parallelism must be exploited.

  7. High Performance Parallel Architectures

    NASA Technical Reports Server (NTRS)

    El-Ghazawi, Tarek; Kaewpijit, Sinthop

    1998-01-01

    Traditional remote sensing instruments are multispectral, where observations are collected at a few different spectral bands. Recently, many hyperspectral instruments, that can collect observations at hundreds of bands, have been operational. Furthermore, there have been ongoing research efforts on ultraspectral instruments that can produce observations at thousands of spectral bands. While these remote sensing technology developments hold great promise for new findings in the area of Earth and space science, they present many challenges. These include the need for faster processing of such increased data volumes, and methods for data reduction. Dimension Reduction is a spectral transformation, aimed at concentrating the vital information and discarding redundant data. One such transformation, which is widely used in remote sensing, is the Principal Components Analysis (PCA). This report summarizes our progress on the development of a parallel PCA and its implementation on two Beowulf cluster configuration; one with fast Ethernet switch and the other with a Myrinet interconnection. Details of the implementation and performance results, for typical sets of multispectral and hyperspectral NASA remote sensing data, are presented and analyzed based on the algorithm requirements and the underlying machine configuration. It will be shown that the PCA application is quite challenging and hard to scale on Ethernet-based clusters. However, the measurements also show that a high- performance interconnection network, such as Myrinet, better matches the high communication demand of PCA and can lead to a more efficient PCA execution.

  8. Trajectories in parallel optics.

    PubMed

    Klapp, Iftach; Sochen, Nir; Mendlovic, David

    2011-10-01

    In our previous work we showed the ability to improve the optical system's matrix condition by optical design, thereby improving its robustness to noise. It was shown that by using singular value decomposition, a target point-spread function (PSF) matrix can be defined for an auxiliary optical system, which works parallel to the original system to achieve such an improvement. In this paper, after briefly introducing the all optics implementation of the auxiliary system, we show a method to decompose the target PSF matrix. This is done through a series of shifted responses of auxiliary optics (named trajectories), where a complicated hardware filter is replaced by postprocessing. This process manipulates the pixel confined PSF response of simple auxiliary optics, which in turn creates an auxiliary system with the required PSF matrix. This method is simulated on two space variant systems and reduces their system condition number from 18,598 to 197 and from 87,640 to 5.75, respectively. We perform a study of the latter result and show significant improvement in image restoration performance, in comparison to a system without auxiliary optics and to other previously suggested hybrid solutions. Image restoration results show that in a range of low signal-to-noise ratio values, the trajectories method gives a significant advantage over alternative approaches. A third space invariant study case is explored only briefly, and we present a significant improvement in the matrix condition number from 1.9160e+013 to 34,526.

  9. Parallel Ada benchmarks for the SVMS

    NASA Technical Reports Server (NTRS)

    Collard, Philippe E.

    1990-01-01

    The use of parallel processing paradigm to design and develop faster and more reliable computers appear to clearly mark the future of information processing. NASA started the development of such an architecture: the Spaceborne VHSIC Multi-processor System (SVMS). Ada will be one of the languages used to program the SVMS. One of the unique characteristics of Ada is that it supports parallel processing at the language level through the tasking constructs. It is important for the SVMS project team to assess how efficiently the SVMS architecture will be implemented, as well as how efficiently Ada environment will be ported to the SVMS. AUTOCLASS II, a Bayesian classifier written in Common Lisp, was selected as one of the benchmarks for SVMS configurations. The purpose of the R and D effort was to provide the SVMS project team with the version of AUTOCLASS II, written in Ada, that would make use of Ada tasking constructs as much as possible so as to constitute a suitable benchmark. Additionally, a set of programs was developed that would measure Ada tasking efficiency on parallel architectures as well as determine the critical parameters influencing tasking efficiency. All this was designed to provide the SVMS project team with a set of suitable tools in the development of the SVMS architecture.

  10. ICASE Computer Science Program

    NASA Technical Reports Server (NTRS)

    1985-01-01

    The Institute for Computer Applications in Science and Engineering computer science program is discussed in outline form. Information is given on such topics as problem decomposition, algorithm development, programming languages, and parallel architectures.

  11. Scalable Unix commands for parallel processors : a high-performance implementation.

    SciTech Connect

    Ong, E.; Lusk, E.; Gropp, W.

    2001-06-22

    We describe a family of MPI applications we call the Parallel Unix Commands. These commands are natural parallel versions of common Unix user commands such as ls, ps, and find, together with a few similar commands particular to the parallel environment. We describe the design and implementation of these programs and present some performance results on a 256-node Linux cluster. The Parallel Unix Commands are open source and freely available.

  12. Coordinated Parallel Runway Approaches

    NASA Technical Reports Server (NTRS)

    Koczo, Steve

    1996-01-01

    The current air traffic environment in airport terminal areas experiences substantial delays when weather conditions deteriorate to Instrument Meteorological Conditions (IMC). Expected future increases in air traffic will put additional pressures on the National Airspace System (NAS) and will further compound the high costs associated with airport delays. To address this problem, NASA has embarked on a program to address Terminal Area Productivity (TAP). The goals of the TAP program are to provide increased efficiencies in air traffic during the approach, landing, and surface operations in low-visibility conditions. The ultimate goal is to achieve efficiencies of terminal area flight operations commensurate with Visual Meteorological Conditions (VMC) at current or improved levels of safety.

  13. Solving tridiagonal linear systems on the Butterfly parallel computer

    SciTech Connect

    Kumar, S.P.

    1989-01-01

    A parallel block partitioning method to solve a tri-diagonal system of linear equations is adapted to the BBN Butterfly multiprocessor. A performance analysis of the programming experiments on the 32-node Butterfly is presented. An upper bound on the number of processors to achieve the best performance with this method is derived. The computational results verify the theoretical speedup and efficiency results of the parallel algorithm over its serial counterpart. Also included is a study comparing performance runs of the same code on the Butterfly processor with a hardware floating point unit and on one with a software floating point facility. The total parallel time of the given code is considerably reduced by making use of the hardware floating point facility whereas the speedup and efficiency of the parallel program considerably improve on the system with software floating point capability. The achieved results are shown to be within 82% to 90% of the predicted performance.

  14. An Expert Assistant for Computer Aided Parallelization

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Chun, Robert; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit

    2004-01-01

    The prototype implementation of an expert system was developed to assist the user in the computer aided parallelization process. The system interfaces to tools for automatic parallelization and performance analysis. By fusing static program structure information and dynamic performance analysis data the expert system can help the user to filter, correlate, and interpret the data gathered by the existing tools. Sections of the code that show poor performance and require further attention are rapidly identified and suggestions for improvements are presented to the user. In this paper we describe the components of the expert system and discuss its interface to the existing tools. We present a case study to demonstrate the successful use in full scale scientific applications.

  15. Parallel computation of Feynman loop integrals

    NASA Astrophysics Data System (ADS)

    de Doncker, E.; Yuasa, F.

    2012-12-01

    The need for large numbers of compute-intensive integrals, arising in quantum field theory perturbation calculations, justifies the parallelization of loop integrals. In earlier work, we devised effective multivariate methods by iterated (repeated) adaptive numerical integration and extrapolation, applicable for some problem classes where standard multivariate integration techniques fail through integrand singularities. In repeated integration, the function evaluations for the outer integral are independent integral computations for the next lower level and can be distributed to threads in a multi-core parallelization. The distribution level determines the number of dimensions in the inner integral below it and thus the granularity of the computation. We discuss a multi-threaded implementation over the OpenMP Application Program Interface.

  16. Flexibility and Performance of Parallel File Systems

    NASA Technical Reports Server (NTRS)

    Kotz, David; Nieuwejaar, Nils

    1996-01-01

    As we gain experience with parallel file systems, it becomes increasingly clear that a single solution does not suit all applications. For example, it appears to be impossible to find a single appropriate interface, caching policy, file structure, or disk-management strategy. Furthermore, the proliferation of file-system interfaces and abstractions make applications difficult to port. We propose that the traditional functionality of parallel file systems be separated into two components: a fixed core that is standard on all platforms, encapsulating only primitive abstractions and interfaces, and a set of high-level libraries to provide a variety of abstractions and application-programmer interfaces (API's). We present our current and next-generation file systems as examples of this structure. Their features, such as a three-dimensional file structure, strided read and write interfaces, and I/O-node programs, are specifically designed with the flexibility and performance necessary to support a wide range of applications.

  17. Parallel machine architecture for production rule systems

    DOEpatents

    Allen, Jr., John D.; Butler, Philip L.

    1989-01-01

    A parallel processing system for production rule programs utilizes a host processor for storing production rule right hand sides (RHS) and a plurality of rule processors for storing left hand sides (LHS). The rule processors operate in parallel in the recognize phase of the system recognize -Act Cycle to match their respective LHS's against a stored list of working memory elements (WME) in order to find a self consistent set of WME's. The list of WME is dynamically varied during the Act phase of the system in which the host executes or fires rule RHS's for those rules for which a self-consistent set has been found by the rule processors. The host transmits instructions for creating or deleting working memory elements as dictated by the rule firings until the rule processors are unable to find any further self-consistent working memory element sets at which time the production rule system is halted.

  18. Parallel Adaptive Mesh Refinement

    SciTech Connect

    Diachin, L; Hornung, R; Plassmann, P; WIssink, A

    2005-03-04

    As large-scale, parallel computers have become more widely available and numerical models and algorithms have advanced, the range of physical phenomena that can be simulated has expanded dramatically. Many important science and engineering problems exhibit solutions with localized behavior where highly-detailed salient features or large gradients appear in certain regions which are separated by much larger regions where the solution is smooth. Examples include chemically-reacting flows with radiative heat transfer, high Reynolds number flows interacting with solid objects, and combustion problems where the flame front is essentially a two-dimensional sheet occupying a small part of a three-dimensional domain. Modeling such problems numerically requires approximating the governing partial differential equations on a discrete domain, or grid. Grid spacing is an important factor in determining the accuracy and cost of a computation. A fine grid may be needed to resolve key local features while a much coarser grid may suffice elsewhere. Employing a fine grid everywhere may be inefficient at best and, at worst, may make an adequately resolved simulation impractical. Moreover, the location and resolution of fine grid required for an accurate solution is a dynamic property of a problem's transient features and may not be known a priori. Adaptive mesh refinement (AMR) is a technique that can be used with both structured and unstructured meshes to adjust local grid spacing dynamically to capture solution features with an appropriate degree of resolution. Thus, computational resources can be focused where and when they are needed most to efficiently achieve an accurate solution without incurring the cost of a globally-fine grid. Figure 1.1 shows two example computations using AMR; on the left is a structured mesh calculation of a impulsively-sheared contact surface and on the right is the fuselage and volume discretization of an RAH-66 Comanche helicopter [35]. Note the

  19. Problem solving on computers with parallel organization of computations

    SciTech Connect

    Glushkov, V.M.; Molchanov, I.N.

    1981-07-01

    In principle, parallelism can be implemented on different levels: the level of construction of physical models, objects or processes representing the phenomenon being studied, the level of the solution method, the level of the algorithm, the level of the program, the level of data exchange in the computer, and the level of data input and output. The authors consider some of the available parallelism techniques. 9 references.

  20. An integrated approach to improving the parallel applications development process

    SciTech Connect

    Rasmussen, Craig E; Watson, Gregory R; Tibbitts, Beth R

    2009-01-01

    The development of parallel applications is becoming increasingly important to a broad range of industries. Traditionally, parallel programming was a niche area that was primarily exploited by scientists trying to model extremely complicated physical phenomenon. It is becoming increasingly clear, however, that continued hardware performance improvements through clock scaling and feature-size reduction are simply not going to be achievable for much longer. The hardware vendor's approach to addressing this issue is to employ parallelism through multi-processor and multi-core technologies. While there is little doubt that this approach produces scaling improvements, there are still many significant hurdles to be overcome before parallelism can be employed as a general replacement to more traditional programming techniques. The Parallel Tools Platform (PTP) Project was created in 2005 in an attempt to provide developers with new tools aimed at addressing some of the parallel development issues. Since then, the introduction of a new generation of peta-scale and multi-core systems has highlighted the need for such a platform. In this paper, we describe some of the challenges facing parallel application developers, present the current state of PTP, and provide a simple case study that demonstrates how PTP can be used to locate a potential deadlock situation in an MPI code.

  1. The Galley Parallel File System

    NASA Technical Reports Server (NTRS)

    Nieuwejaar, Nils; Kotz, David

    1996-01-01

    Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/0 requirements of parallel scientific applications. Many multiprocessor file systems provide applications with a conventional Unix-like interface, allowing the application to access multiple disks transparently. This interface conceals the parallelism within the file system, increasing the ease of programmability, but making it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. In addition to providing an insufficient interface, most current multiprocessor file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic scientific multiprocessor workloads. We discuss Galley's file structure and application interface, as well as the performance advantages offered by that interface.

  2. Parallel contingency statistics with Titan.

    SciTech Connect

    Thompson, David C.; Pebay, Philippe Pierre

    2009-09-01

    This report summarizes existing statistical engines in VTK/Titan and presents the recently parallelized contingency statistics engine. It is a sequel to [PT08] and [BPRT09] which studied the parallel descriptive, correlative, multi-correlative, and principal component analysis engines. The ease of use of this new parallel engines is illustrated by the means of C++ code snippets. Furthermore, this report justifies the design of these engines with parallel scalability in mind; however, the very nature of contingency tables prevent this new engine from exhibiting optimal parallel speed-up as the aforementioned engines do. This report therefore discusses the design trade-offs we made and study performance with up to 200 processors.

  3. Nemesis I: Parallel Enhancements to ExodusII

    2006-03-28

    NEMESIS I is an enhancement to the EXODUS II finite element database model used to store and retrieve data for unstructured parallel finite element analyses. NEMESIS I adds data structures which facilitate the partitioning of a scalar (standard serial) EXODUS II file onto parallel disk systems found on many parallel computers. Since the NEMESIS I application programming interface (APl)can be used to append information to an existing EXODUS II files can be used on filesmore » which contain NEMESIS I information. The NEMESIS I information is written and read via C or C++ callable functions which compromise the NEMESIS I API.« less

  4. Beyond the Renderer: Software Architecture for Parallel Graphics and Visualization

    NASA Technical Reports Server (NTRS)

    Crockett, Thomas W.

    1996-01-01

    As numerous implementations have demonstrated, software-based parallel rendering is an effective way to obtain the needed computational power for a variety of challenging applications in computer graphics and scientific visualization. To fully realize their potential, however, parallel renderers need to be integrated into a complete environment for generating, manipulating, and delivering visual data. We examine the structure and components of such an environment, including the programming and user interfaces, rendering engines, and image delivery systems. We consider some of the constraints imposed by real-world applications and discuss the problems and issues involved in bringing parallel rendering out of the lab and into production.

  5. Nemesis I: Parallel Enhancements to ExodusII

    SciTech Connect

    Hennigan, Gary L.; John, Matthew S.; Shadid, John N.; Sjaardema, Gregory D.; Brayer, Mike S.

    2006-03-28

    NEMESIS I is an enhancement to the EXODUS II finite element database model used to store and retrieve data for unstructured parallel finite element analyses. NEMESIS I adds data structures which facilitate the partitioning of a scalar (standard serial) EXODUS II file onto parallel disk systems found on many parallel computers. Since the NEMESIS I application programming interface (APl)can be used to append information to an existing EXODUS II files can be used on files which contain NEMESIS I information. The NEMESIS I information is written and read via C or C++ callable functions which compromise the NEMESIS I API.

  6. MILP model for resource disruption in parallel processor system

    NASA Astrophysics Data System (ADS)

    Nordin, Syarifah Zyurina; Caccetta, Louis

    2015-02-01

    In this paper, we consider the existence of disruption on unrelated parallel processor scheduling system. The disruption occurs due to a resource shortage where one of the parallel processors is facing breakdown problem during the task allocation, which give impact to the initial scheduling plan. Our objective is to reschedule the original unrelated parallel processor scheduling after the resource disruption that minimizes the makespan. A mixed integer linear programming model is presented for the recovery scheduling that considers the post-disruption policy. We conduct a computational experiment with different stopping time limit to see the performance of the model by using CPLEX 12.1 solver in AIMMS 3.10 software.

  7. Parallel implementation of electronic structure energy, gradient, and Hessian calculations.

    PubMed

    Lotrich, V; Flocke, N; Ponton, M; Yau, A D; Perera, A; Deumens, E; Bartlett, R J

    2008-05-21

    ACES III is a newly written program in which the computationally demanding components of the computational chemistry code ACES II [J. F. Stanton et al., Int. J. Quantum Chem. 526, 879 (1992); [ACES II program system, University of Florida, 1994] have been redesigned and implemented in parallel. The high-level algorithms include Hartree-Fock (HF) self-consistent field (SCF), second-order many-body perturbation theory [MBPT(2)] energy, gradient, and Hessian, and coupled cluster singles, doubles, and perturbative triples [CCSD(T)] energy and gradient. For SCF, MBPT(2), and CCSD(T), both restricted HF and unrestricted HF reference wave functions are available. For MBPT(2) gradients and Hessians, a restricted open-shell HF reference is also supported. The methods are programed in a special language designed for the parallelization project. The language is called super instruction assembly language (SIAL). The design uses an extreme form of object-oriented programing. All compute intensive operations, such as tensor contractions and diagonalizations, all communication operations, and all input-output operations are handled by a parallel program written in C and FORTRAN 77. This parallel program, called the super instruction processor (SIP), interprets and executes the SIAL program. By separating the algorithmic complexity (in SIAL) from the complexities of execution on computer hardware (in SIP), a software system is created that allows for very effective optimization and tuning on different hardware architectures with quite manageable effort. PMID:18500853

  8. Parallel implementation of electronic structure energy, gradient, and Hessian calculations

    NASA Astrophysics Data System (ADS)

    Lotrich, V.; Flocke, N.; Ponton, M.; Yau, A. D.; Perera, A.; Deumens, E.; Bartlett, R. J.

    2008-05-01

    ACES III is a newly written program in which the computationally demanding components of the computational chemistry code ACES II [J. F. Stanton et al., Int. J. Quantum Chem. 526, 879 (1992); [ACES II program system, University of Florida, 1994] have been redesigned and implemented in parallel. The high-level algorithms include Hartree-Fock (HF) self-consistent field (SCF), second-order many-body perturbation theory [MBPT(2)] energy, gradient, and Hessian, and coupled cluster singles, doubles, and perturbative triples [CCSD(T)] energy and gradient. For SCF, MBPT(2), and CCSD(T), both restricted HF and unrestricted HF reference wave functions are available. For MBPT(2) gradients and Hessians, a restricted open-shell HF reference is also supported. The methods are programed in a special language designed for the parallelization project. The language is called super instruction assembly language (SIAL). The design uses an extreme form of object-oriented programing. All compute intensive operations, such as tensor contractions and diagonalizations, all communication operations, and all input-output operations are handled by a parallel program written in C and FORTRAN 77. This parallel program, called the super instruction processor (SIP), interprets and executes the SIAL program. By separating the algorithmic complexity (in SIAL) from the complexities of execution on computer hardware (in SIP), a software system is created that allows for very effective optimization and tuning on different hardware architectures with quite manageable effort.

  9. HPC Infrastructure for Solid Earth Simulation on Parallel Computers

    NASA Astrophysics Data System (ADS)

    Nakajima, K.; Chen, L.; Okuda, H.

    2004-12-01

    Recently, various types of parallel computers with various types of architectures and processing elements (PE) have emerged, which include PC clusters and the Earth Simulator. Moreover, users can easily access to these computer resources through network on Grid environment. It is well-known that thorough tuning is required for programmers to achieve excellent performance on each computer. The method for tuning strongly depends on the type of PE and architecture. Optimization by tuning is a very tough work, especially for developers of applications. Moreover, parallel programming using message passing library such as MPI is another big task for application programmers. In GeoFEM project (http://gefeom.tokyo.rist.or.jp), authors have developed a parallel FEM platform for solid earth simulation on the Earth Simulator, which supports parallel I/O, parallel linear solvers and parallel visualization. This platform can efficiently hide complicated procedures for parallel programming and optimization on vector processors from application programmers. This type of infrastructure is very useful. Source codes developed on PC with single processor is easily optimized on massively parallel computer by linking the source code to the parallel platform installed on the target computer. This parallel platform, called HPC Infrastructure will provide dramatic efficiency, portability and reliability in development of scientific simulation codes. For example, line number of the source codes is expected to be less than 10,000 and porting legacy codes to parallel computer takes 2 or 3 weeks. Original GeoFEM platform supports only I/O, linear solvers and visualization. In the present work, further development for adaptive mesh refinement (AMR) and dynamic load-balancing (DLB) have been carried out. In this presentation, examples of large-scale solid earth simulation using the Earth Simulator will be demonstrated. Moreover, recent results of a parallel computational steering tool using an

  10. Parallel solver for trajectory optimization search directions

    NASA Technical Reports Server (NTRS)

    Psiaki, M. L.; Park, K. H.

    1992-01-01

    A key algorithmic element of a real-time trajectory optimization hardware/software implementation is presented, the search step solver. This is one piece of an algorithm whose overall goal is to make nonlinear trajectory optimization fast enough to provide real-time commands during guidance of a vehicle such as an aeromaneuvering orbiter or the National Aerospace Plane. Many methods of nonlinear programming require the solution of a quadratic program (QP) at each iteration to determine the search step. In the trajectory optimization case, the QP has a special dynamic programming structure. The algorithm exploits this special structure with a divide- and conquer type of parallel implementation. The algorithm solves a (p.N)-stage problem on N processors in O(p + log2 N) operations. The algorithm yields a factor of 8 speed-up over the fastest known serial algorithm when solving a 1024-stage test problem on 32 processors.

  11. Aligning parallel arrays to reduce communication

    NASA Technical Reports Server (NTRS)

    Sheffler, Thomas J.; Schreiber, Robert; Gilbert, John R.; Chatterjee, Siddhartha

    1994-01-01

    Axis and stride alignment is an important optimization in compiling data-parallel programs for distributed-memory machines. We previously developed an optimal algorithm for aligning array expressions. Here, we examine alignment for more general program graphs. We show that optimal alignment is NP-complete in this setting, so we study heuristic methods. This paper makes two contributions. First, we show how local graph transformations can reduce the size of the problem significantly without changing the best solution. This allows more complex and effective heuristics to be used. Second, we give a heuristic that can explore the space of possible solutions in a number of ways. We show that some of these strategies can give better solutions than a simple greedy approach proposed earlier. Our algorithms have been implemented; we present experimental results showing their effect on the performance of some example programs running on the CM-5.

  12. Parallelized FVM algorithm for three-dimensional viscoelastic flows

    NASA Astrophysics Data System (ADS)

    Dou, H.-S.; Phan-Thien, N.

    A parallel implementation for the finite volume method (FVM) for three-dimensional (3D) viscoelastic flows is developed on a distributed computing environment through Parallel Virtual Machine (PVM). The numerical procedure is based on the SIMPLEST algorithm using a staggered FVM discretization in Cartesian coordinates. The final discretized algebraic equations are solved with the TDMA method. The parallelisation of the program is implemented by a domain decomposition strategy, with a master/slave style programming paradigm, and a message passing through PVM. A load balancing strategy is proposed to reduce the communications between processors. The three-dimensional viscoelastic flow in a rectangular duct is computed with this program. The modified Phan-Thien-Tanner (MPTT) constitutive model is employed for the equation system closure. Computing results are validated on the secondary flow problem due to non-zero second normal stress difference N2. Three sets of meshes are used, and the effect of domain decomposition strategies on the performance is discussed. It is found that parallel efficiency is strongly dependent on the grid size and the number of processors for a given block number. The convergence rate as well as the total efficiency of domain decomposition depends upon the flow problem and the boundary conditions. The parallel efficiency increases with increasing problem size for given block number. Comparing to two-dimensional flow problems, 3D parallelized algorithm has a lower efficiency owing to largely overlapped block interfaces, but the parallel algorithm is indeed a powerful means for large scale flow simulations.

  13. Endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R; Ratterman, Joseph D; Smith, Brian E

    2014-11-18

    Methods, apparatuses, and computer program products for endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface (`PAMI`) of a parallel computer are provided. Embodiments include establishing by a parallel application a data communications geometry, the geometry specifying a set of endpoints that are used in collective operations of the PAMI, including associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry. Embodiments also include registering in each endpoint in the geometry a dispatch callback function for a collective operation and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.

  14. Parallel processing and expert systems

    NASA Technical Reports Server (NTRS)

    Lau, Sonie; Yan, Jerry C.

    1991-01-01

    Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 1990s cannot enjoy an increased level of autonomy without the efficient implementation of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real-time demands are met for larger systems. Speedup via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial laboratories in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems is surveyed. The survey discusses multiprocessors for expert systems, parallel languages for symbolic computations, and mapping expert systems to multiprocessors. Results to date indicate that the parallelism achieved for these systems is small. The main reasons are (1) the body of knowledge applicable in any given situation and the amount of computation executed by each rule firing are small, (2) dividing the problem solving process into relatively independent partitions is difficult, and (3) implementation decisions that enable expert systems to be incrementally refined hamper compile-time optimization. In order to obtain greater speedups, data parallelism and application parallelism must be exploited.

  15. Template based parallel checkpointing in a massively parallel computer system

    DOEpatents

    Archer, Charles Jens; Inglett, Todd Alan

    2009-01-13

    A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.

  16. A numerical differentiation library exploiting parallel architectures

    NASA Astrophysics Data System (ADS)

    Voglis, C.; Hadjidoukas, P. E.; Lagaris, I. E.; Papageorgiou, D. G.

    2009-08-01

    We present a software library for numerically estimating first and second order partial derivatives of a function by finite differencing. Various truncation schemes are offered resulting in corresponding formulas that are accurate to order O(h), O(h), and O(h), h being the differencing step. The derivatives are calculated via forward, backward and central differences. Care has been taken that only feasible points are used in the case where bound constraints are imposed on the variables. The Hessian may be approximated either from function or from gradient values. There are three versions of the software: a sequential version, an OpenMP version for shared memory architectures and an MPI version for distributed systems (clusters). The parallel versions exploit the multiprocessing capability offered by computer clusters, as well as modern multi-core systems and due to the independent character of the derivative computation, the speedup scales almost linearly with the number of available processors/cores. Program summaryProgram title: NDL (Numerical Differentiation Library) Catalogue identifier: AEDG_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDG_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 73 030 No. of bytes in distributed program, including test data, etc.: 630 876 Distribution format: tar.gz Programming language: ANSI FORTRAN-77, ANSI C, MPI, OPENMP Computer: Distributed systems (clusters), shared memory systems Operating system: Linux, Solaris Has the code been vectorised or parallelized?: Yes RAM: The library uses O(N) internal storage, N being the dimension of the problem Classification: 4.9, 4.14, 6.5 Nature of problem: The numerical estimation of derivatives at several accuracy levels is a common requirement in many computational tasks, such

  17. QCMPI: A parallel environment for quantum computing

    NASA Astrophysics Data System (ADS)

    Tabakin, Frank; Juliá-Díaz, Bruno

    2009-06-01

    QCMPI is a quantum computer (QC) simulation package written in Fortran 90 with parallel processing capabilities. It is an accessible research tool that permits rapid evaluation of quantum algorithms for a large number of qubits and for various "noise" scenarios. The prime motivation for developing QCMPI is to facilitate numerical examination of not only how QC algorithms work, but also to include noise, decoherence, and attenuation effects and to evaluate the efficacy of error correction schemes. The present work builds on an earlier Mathematica code QDENSITY, which is mainly a pedagogic tool. In that earlier work, although the density matrix formulation was featured, the description using state vectors was also provided. In QCMPI, the stress is on state vectors, in order to employ a large number of qubits. The parallel processing feature is implemented by using the Message-Passing Interface (MPI) protocol. A description of how to spread the wave function components over many processors is provided, along with how to efficiently describe the action of general one- and two-qubit operators on these state vectors. These operators include the standard Pauli, Hadamard, CNOT and CPHASE gates and also Quantum Fourier transformation. These operators make up the actions needed in QC. Codes for Grover's search and Shor's factoring algorithms are provided as examples. A major feature of this work is that concurrent versions of the algorithms can be evaluated with each version subject to alternate noise effects, which corresponds to the idea of solving a stochastic Schrödinger equation. The density matrix for the ensemble of such noise cases is constructed using parallel distribution methods to evaluate its eigenvalues and associated entropy. Potential applications of this powerful tool include studies of the stability and correction of QC processes using Hamiltonian based dynamics. Program summaryProgram title: QCMPI Catalogue identifier: AECS_v1_0 Program summary URL

  18. A parallel trajectory optimization tool for aerospace plane guidance

    NASA Technical Reports Server (NTRS)

    Psiaki, Mark L.; Park, Kihong

    1991-01-01

    A parallel trajectory optimization algorithm is being developed. One possible mission is to provide real-time, on-line guidance for the National Aerospace Plane. The algorithm solves a discrete-time problem via the augmented Lagrangian nonlinear programming algorithm. The algorithm exploits the dynamic programming structure of the problem to achieve parallelism in calculating cost functions, gradients, constraints, Jacobians, Hessian approximations, search directions, and merit functions. Special additions to the augmented Lagrangian algorithm achieve robust convergence, achieve (almost) superlinear local convergence, and deal with constraint curvature efficiency. The algorithm can handle control and state inequality constraints such as angle-of-attack and dynamic pressure constraints. Portions of the algorithm have been tested. The nonlinear programming core algorithm performs well on a variety of static test problems and on an orbit transfer problem. The parallel search direction algorithm can reduce wall clock time by a factor of 10 for this part of the computation task.

  19. Multi-microprocessor that executes pure LISP in parallel

    SciTech Connect

    Guzman, A.

    1982-01-01

    The architecture presented allows parallel computation of high level languages, with some advantages: (1) the programmer is unaware that he is writing programs for a parallel computer; (2) the processors communicate little with each other, so that interconnection problems are minimised; (3) a given processor is unaware of how many other processors there are, or what they are doing; (4) a processor never waits for another process to have finished, nor does it awake or interrupt another processor. The machine processes in parallel programs written in high level languages capable of being expressed in the lambda notation (applicative languages). It is formed by a collection of general purpose processors which are weakly coupled and without hierarchy. Asynchronous computation is permitted by each processor evaluating a part of a program. 17 references.

  20. Experimental Parallel-Processing Computer

    NASA Technical Reports Server (NTRS)

    Mcgregor, J. W.; Salama, M. A.

    1986-01-01

    Master processor supervises slave processors, each with its own memory. Computer with parallel processing serves as inexpensive tool for experimentation with parallel mathematical algorithms. Speed enhancement obtained depends on both nature of problem and structure of algorithm used. In parallel-processing architecture, "bank select" and control signals determine which one, if any, of N slave processor memories accessible to master processor at any given moment. When so selected, slave memory operates as part of master computer memory. When not selected, slave memory operates independently of main memory. Slave processors communicate with each other via input/output bus.

  1. Parallelized Seeded Region Growing Using CUDA

    PubMed Central

    Park, Seongjin; Lee, Hyunna; Seo, Jinwook; Lee, Kyoung Ho; Shin, Yeong-Gil; Kim, Bohyoung

    2014-01-01

    This paper presents a novel method for parallelizing the seeded region growing (SRG) algorithm using Compute Unified Device Architecture (CUDA) technology, with intention to overcome the theoretical weakness of SRG algorithm of its computation time being directly proportional to the size of a segmented region. The segmentation performance of the proposed CUDA-based SRG is compared with SRG implementations on single-core CPUs, quad-core CPUs, and shader language programming, using synthetic datasets and 20 body CT scans. Based on the experimental results, the CUDA-based SRG outperforms the other three implementations, advocating that it can substantially assist the segmentation during massive CT screening tests. PMID:25309619

  2. Parallelized seeded region growing using CUDA.

    PubMed

    Park, Seongjin; Lee, Jeongjin; Lee, Hyunna; Shin, Juneseuk; Seo, Jinwook; Lee, Kyoung Ho; Shin, Yeong-Gil; Kim, Bohyoung

    2014-01-01

    This paper presents a novel method for parallelizing the seeded region growing (SRG) algorithm using Compute Unified Device Architecture (CUDA) technology, with intention to overcome the theoretical weakness of SRG algorithm of its computation time being directly proportional to the size of a segmented region. The segmentation performance of the proposed CUDA-based SRG is compared with SRG implementations on single-core CPUs, quad-core CPUs, and shader language programming, using synthetic datasets and 20 body CT scans. Based on the experimental results, the CUDA-based SRG outperforms the other three implementations, advocating that it can substantially assist the segmentation during massive CT screening tests.

  3. Parallel Robot for Lower Limb Rehabilitation Exercises

    PubMed Central

    Saadat, Mozafar; Borboni, Alberto

    2016-01-01

    The aim of this study is to investigate the capability of a 6-DoF parallel robot to perform various rehabilitation exercises. The foot trajectories of twenty healthy participants have been measured by a Vicon system during the performing of four different exercises. Based on the kinematics and dynamics of a parallel robot, a MATLAB program was developed in order to calculate the length of the actuators, the actuators' forces, workspace, and singularity locus of the robot during the performing of the exercises. The calculated length of the actuators and the actuators' forces were used by motion analysis in SolidWorks in order to simulate different foot trajectories by the CAD model of the robot. A physical parallel robot prototype was built in order to simulate and execute the foot trajectories of the participants. Kinect camera was used to track the motion of the leg's model placed on the robot. The results demonstrate the robot's capability to perform a full range of various rehabilitation exercises. PMID:27799727

  4. Parallel architectures and neural networks

    SciTech Connect

    Calianiello, E.R. )

    1989-01-01

    This book covers parallel computer architectures and neural networks. Topics include: neural modeling, use of ADA to simulate neural networks, VLSI technology, implementation of Boltzmann machines, and analysis of neural nets.

  5. "Feeling" Series and Parallel Resistances.

    ERIC Educational Resources Information Center

    Morse, Robert A.

    1993-01-01

    Equipped with drinking straws and stirring straws, a teacher can help students understand how resistances in electric circuits combine in series and in parallel. Follow-up suggestions are provided. (ZWH)

  6. Demonstrating Forces between Parallel Wires.

    ERIC Educational Resources Information Center

    Baker, Blane

    2000-01-01

    Describes a physics demonstration that dramatically illustrates the mutual repulsion (attraction) between parallel conductors using insulated copper wire, wooden dowels, a high direct current power supply, electrical tape, and an overhead projector. (WRM)

  7. Metal structures with parallel pores

    NASA Technical Reports Server (NTRS)

    Sherfey, J. M.

    1976-01-01

    Four methods of fabricating metal plates having uniformly sized parallel pores are studied: elongate bundle, wind and sinter, extrude and sinter, and corrugate stack. Such plates are suitable for electrodes for electrochemical and fuel cells.

  8. Parallel computation using limited resources

    SciTech Connect

    Sugla, B.

    1985-01-01

    This thesis addresses itself to the task of designing and analyzing parallel algorithms when the resources of processors, communication, and time are limited. The two parts of this thesis deal with multiprocessor systems and VLSI - the two important parallel processing environments that are prevalent today. In the first part a time-processor-communication tradeoff analysis is conducted for two kinds of problems - N input, 1 output, and N input, N output computations. In the class of problems of the second kind, the problem of prefix computation, an important problem due to the number of naturally occurring computations it can model, is studied. Finally, a general methodology is given for design of parallel algorithms that can be used to optimize a given design to a wide set of architectural variations. The second part of the thesis considers the design of parallel algorithms for the VLSI model of computation when the resource of time is severely restricted.

  9. Parallel algorithms for message decomposition

    SciTech Connect

    Teng, S.H.; Wang, B.

    1987-06-01

    The authors consider the deterministic and random parallel complexity (time and processor) of message decoding: an essential problem in communications systems and translation systems. They present an optimal parallel algorithm to decompose prefix-coded messages and uniquely decipherable-coded messages in O(n/P) time, using O(P) processors (for all P:1 less than or equal toPless than or equal ton/log n) deterministically as well as randomly on the weakest version of parallel random access machines in which concurrent read and concurrent write to a cell in the common memory are not allowed. This is done by reducing decoding to parallel finite-state automata simulation and the prefix sums.

  10. Turbomachinery CFD on parallel computers

    NASA Technical Reports Server (NTRS)

    Blech, Richard A.; Milner, Edward J.; Quealy, Angela; Townsend, Scott E.

    1992-01-01

    The role of multistage turbomachinery simulation in the development of propulsion system models is discussed. Particularly, the need for simulations with higher fidelity and faster turnaround time is highlighted. It is shown how such fast simulations can be used in engineering-oriented environments. The use of parallel processing to achieve the required turnaround times is discussed. Current work by several researchers in this area is summarized. Parallel turbomachinery CFD research at the NASA Lewis Research Center is then highlighted. These efforts are focused on implementing the average-passage turbomachinery model on MIMD, distributed memory parallel computers. Performance results are given for inviscid, single blade row and viscous, multistage applications on several parallel computers, including networked workstations.

  11. Automatic parallelization of while-Loops using speculative execution

    SciTech Connect

    Collard, J.F.

    1995-04-01

    Automatic parallelization of imperative sequential programs has focused on nests of for-loops. The most recent of them consist in finding an affine mapping with respect to the loop indices to simultaneously capture the temporal and spatial properties of the parallelized program. Such a mapping is usually called a {open_quotes}space-time transformation.{close_quotes} This work describes an extension of these techniques to while-loops using speculative execution. We show that space-time transformations are a good framework for summing up previous restructuration techniques of while-loop, such as pipelining. Moreover, we show that these transformations can be derived and applied automatically.

  12. Graphics applications utilizing parallel processing

    NASA Technical Reports Server (NTRS)

    Rice, John R.

    1990-01-01

    The results are presented of research conducted to develop a parallel graphic application algorithm to depict the numerical solution of the 1-D wave equation, the vibrating string. The research was conducted on a Flexible Flex/32 multiprocessor and a Sequent Balance 21000 multiprocessor. The wave equation is implemented using the finite difference method. The synchronization issues that arose from the parallel implementation and the strategies used to alleviate the effects of the synchronization overhead are discussed.

  13. HEATR project: ATR algorithm parallelization

    NASA Astrophysics Data System (ADS)

    Deardorf, Catherine E.

    1998-09-01

    High Performance Computing (HPC) Embedded Application for Target Recognition (HEATR) is a project funded by the High Performance Computing Modernization Office through the Common HPC Software Support Initiative (CHSSI). The goal of CHSSI is to produce portable, parallel, multi-purpose, freely distributable, support software to exploit emerging parallel computing technologies and enable application of scalable HPC's for various critical DoD applications. Specifically, the CHSSI goal for HEATR is to provide portable, parallel versions of several existing ATR detection and classification algorithms to the ATR-user community to achieve near real-time capability. The HEATR project will create parallel versions of existing automatic target recognition (ATR) detection and classification algorithms and generate reusable code that will support porting and software development process for ATR HPC software. The HEATR Team has selected detection/classification algorithms from both the model- based and training-based (template-based) arena in order to consider the parallelization requirements for detection/classification algorithms across ATR technology. This would allow the Team to assess the impact that parallelization would have on detection/classification performance across ATR technology. A field demo is included in this project. Finally, any parallel tools produced to support the project will be refined and returned to the ATR user community along with the parallel ATR algorithms. This paper will review: (1) HPCMP structure as it relates to HEATR, (2) Overall structure of the HEATR project, (3) Preliminary results for the first algorithm Alpha Test, (4) CHSSI requirements for HEATR, and (5) Project management issues and lessons learned.

  14. Efficiency of parallel direct optimization

    NASA Technical Reports Server (NTRS)

    Janies, D. A.; Wheeler, W. C.

    2001-01-01

    Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.

  15. FEREBUS: Highly parallelized engine for kriging training.

    PubMed

    Di Pasquale, Nicodemo; Bane, Michael; Davie, Stuart J; Popelier, Paul L A

    2016-11-01

    FFLUX is a novel force field based on quantum topological atoms, combining multipolar electrostatics with IQA intraatomic and interatomic energy terms. The program FEREBUS calculates the hyperparameters of models produced by the machine learning method kriging. Calculation of kriging hyperparameters (θ and p) requires the optimization of the concentrated log-likelihood L̂(θ,p). FEREBUS uses Particle Swarm Optimization (PSO) and Differential Evolution (DE) algorithms to find the maximum of L̂(θ,p). PSO and DE are two heuristic algorithms that each use a set of particles or vectors to explore the space in which L̂(θ,p) is defined, searching for the maximum. The log-likelihood is a computationally expensive function, which needs to be calculated several times during each optimization iteration. The cost scales quickly with the problem dimension and speed becomes critical in model generation. We present the strategy used to parallelize FEREBUS, and the optimization of L̂(θ,p) through PSO and DE. The code is parallelized in two ways. MPI parallelization distributes the particles or vectors among the different processes, whereas the OpenMP implementation takes care of the calculation of L̂(θ,p), which involves the calculation and inversion of a particular matrix, whose size increases quickly with the dimension of the problem. The run time shows a speed-up of 61 times going from single core to 90 cores with a saving, in one case, of ∼98% of the single core time. In fact, the parallelization scheme presented reduces computational time from 2871 s for a single core calculation, to 41 s for 90 cores calculation. © 2016 The Authors. Journal of Computational Chemistry Published by Wiley Periodicals, Inc. PMID:27649926

  16. VIRUS Parallel Observations with The Hobby-Eberly Telescope

    NASA Astrophysics Data System (ADS)

    Odewahn, Stephen C.; Drory, N.; Gebhardt, K.; de Jong, R.; Allende Prieto, C.; Shetrone, M.; Tuttle, S.; HETDEX Collaboration

    2012-01-01

    The VIRUS spectrograph will be installed on the upgraded Hobby-Eberly Telescope (HET) in the Spring of 2012. This instrument will feature an array of integral field units and will be used primarily to conduct a survey for the HET Dark Energy Experiment (HETDEX). The VIRUS instrument will be configured to allow parallel observations during the times when the High-, Medium- and Low-Resolution Spectrographs are operating as the primary instruments on HET. This parallel mode of observing will be enabled long after HETDEX is completed and VIRUS becomes a service instrument on HET. In an effort to explore various scientific uses for such parallel data, we have taken the record of all HET observations for the years 2003 through 2009 and estimated the sky coverage that VIRUS parallel data would have provided. We have used the IFU footprint of VIRUS as it is currently configured, and all observations with the HET spectrographs that meet criteria such as length of exposure time, sky brightness, galactic latitude; and positionally cross-matched these data with various catalogs, such as USNOB2.0, to assess the number of stars and galaxies that would have been detected in a VIRUS parallel program. We review these results here and present plans for software tools that will allow HET users to plan parallel programs.

  17. CRBLASTER: A Parallel-Processing Computational Framework for Embarrassingly Parallel Image-Analysis Algorithms

    NASA Astrophysics Data System (ADS)

    Mighell, Kenneth John

    2010-10-01

    The development of parallel-processing image-analysis codes is generally a challenging task that requires complicated choreography of interprocessor communications. If, however, the image-analysis algorithm is embarrassingly parallel, then the development of a parallel-processing implementation of that algorithm can be a much easier task to accomplish because, by definition, there is little need for communication between the compute processes. I describe the design, implementation, and performance of a parallel-processing image-analysis application, called crblaster, which does cosmic-ray rejection of CCD images using the embarrassingly parallel l.a.cosmic algorithm. crblaster is written in C using the high-performance computing industry standard Message Passing Interface (MPI) library. crblaster uses a two-dimensional image partitioning algorithm that partitions an input image into N rectangular subimages of nearly equal area; the subimages include sufficient additional pixels along common image partition edges such that the need for communication between computer processes is eliminated. The code has been designed to be used by research scientists who are familiar with C as a parallel-processing computational framework that enables the easy development of parallel-processing image-analysis programs based on embarrassingly parallel algorithms. The crblaster source code is freely available at the official application Web site at the National Optical Astronomy Observatory. Removing cosmic rays from a single 800 × 800 pixel Hubble Space Telescope WFPC2 image takes 44 s with the IRAF script lacos_im.cl running on a single core of an Apple Mac Pro computer with two 2.8 GHz quad-core Intel Xeon processors. crblaster is 7.4 times faster when processing the same image on a single core on the same machine. Processing the same image with crblaster simultaneously on all eight cores of the same machine takes 0.875 s—which is a speedup factor of 50.3 times faster than the

  18. Endpoint-based parallel data processing in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

    2014-08-12

    Endpoint-based parallel data processing in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective operation through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.

  19. Endpoint-based parallel data processing in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael E; Ratterman, Joseph D; Smith, Brian E

    2014-02-11

    Endpoint-based parallel data processing in a parallel active messaging interface ('PAMI') of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective opeartion through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.

  20. Transactional memories: A new abstraction for parallel processing

    SciTech Connect

    Fasel, J.H.; Lubeck, O.M.; Agrawal, D.; Bruno, J.L.; El Abbadi, A.

    1997-12-01

    This is the final report of a three-year, Laboratory Directed Research and Development (LDRD) project at Los Alamos National Laboratory (LANL). Current distributed memory multiprocessor computer systems make the development of parallel programs difficult. From a programmer`s perspective, it would be most desirable if the underlying hardware and software could provide the programming abstraction commonly referred to as sequential consistency--a single address space and multiple threads; but enforcement of sequential consistency limits opportunities for architectural and operating system performance optimizations, leading to poor performance. Recently, Herlihy and Moss have introduced a new abstraction called transactional memories for parallel programming. The programming model is shared memory with multiple threads. However, data consistency is obtained through the use of transactions rather than mutual exclusion based on locking. The transaction approach permits the underlying system to exploit the potential parallelism in transaction processing. The authors explore the feasibility of designing parallel programs using the transaction paradigm for data consistency and a barrier type of thread synchronization.

  1. Scioto: A Framework for Global-ViewTask Parallelism

    SciTech Connect

    Dinan, James S.; Krishnamoorthy, Sriram; Larkins, D. B.; Nieplocha, Jaroslaw; Sadayappan, Ponnuswamy

    2008-09-09

    We introduce Scioto, Shared Collections of Task Objects, a framework for supporting task-parallelism in one-sided and global-view parallel programming models. Scioto provides lightweight, locality aware dynamic load balancing and interoperates with existing parallel models including MPI, SHMEM, CAF, and Global Arrays. Through task parallelism, the Scioto framework provides a solution for overcoming load imbalance and heterogeneity as well as dynamic mapping of computation onto emerging multicore architectures. In this paper, we present the design and implementation of the Scioto framework and demonstrate its effectiveness on the Unbalanced Tree Search (UTS) benchmark and two quantum chemistry codes: the closed shell Self-Consistent Field (SCF) method and a sparse tensor contraction kernel extracted from a coupled cluster computation. We explore the efficiency and scalability of Scioto through these sample applications and demonstrate that is offers low overhead, achieves good performance on heterogeneous and multicore clusters, and scales to hundreds of processors.

  2. Parallel Signal Processing and System Simulation using aCe

    NASA Technical Reports Server (NTRS)

    Dorband, John E.; Aburdene, Maurice F.

    2003-01-01

    Recently, networked and cluster computation have become very popular for both signal processing and system simulation. A new language is ideally suited for parallel signal processing applications and system simulation since it allows the programmer to explicitly express the computations that can be performed concurrently. In addition, the new C based parallel language (ace C) for architecture-adaptive programming allows programmers to implement algorithms and system simulation applications on parallel architectures by providing them with the assurance that future parallel architectures will be able to run their applications with a minimum of modification. In this paper, we will focus on some fundamental features of ace C and present a signal processing application (FFT).

  3. Supercomputing on massively parallel bit-serial architectures

    NASA Technical Reports Server (NTRS)

    Iobst, Ken

    1985-01-01

    Research on the Goodyear Massively Parallel Processor (MPP) suggests that high-level parallel languages are practical and can be designed with powerful new semantics that allow algorithms to be efficiently mapped to the real machines. For the MPP these semantics include parallel/associative array selection for both dense and sparse matrices, variable precision arithmetic to trade accuracy for speed, micro-pipelined train broadcast, and conditional branching at the processing element (PE) control unit level. The preliminary design of a FORTRAN-like parallel language for the MPP has been completed and is being used to write programs to perform sparse matrix array selection, min/max search, matrix multiplication, Gaussian elimination on single bit arrays and other generic algorithms. A description is given of the MPP design. Features of the system and its operation are illustrated in the form of charts and diagrams.

  4. Parallel algorithms for high-speed SAR processing

    NASA Astrophysics Data System (ADS)

    Mallorqui, Jordi J.; Bara, Marc; Broquetas, Antoni; Wis, Mariano; Martinez, Antonio; Nogueira, Leonardo; Moreno, Victoriano

    1998-11-01

    The mass production of SAR products and its usage on monitoring emergency situations (oil spill detection, floods, etc.) requires high-speed SAR processors. Two different parallel strategies for near real time SAR processing based on a multiblock version of the Chirp Scaling Algorithm (CSA) have been studied. The first one is useful for small companies that would like to reduce computation times with no extra investment. It uses a cluster of heterogeneous UNIX workstations as a parallel computer. The second one is oriented to institutions, which have to process large amounts of data in short times and can afford the cost of large parallel computers. The parallel programming has reduced in both cases the computational times when compared with the sequential versions.

  5. Clock Agreement Among Parallel Supercomputer Nodes

    DOE Data Explorer

    Jones, Terry R.; Koenig, Gregory A.

    2014-04-30

    This dataset presents measurements that quantify the clock synchronization time-agreement characteristics among several high performance computers including the current world's most powerful machine for open science, the U.S. Department of Energy's Titan machine sited at Oak Ridge National Laboratory. These ultra-fast machines derive much of their computational capability from extreme node counts (over 18000 nodes in the case of the Titan machine). Time-agreement is commonly utilized by parallel programming applications and tools, distributed programming application and tools, and system software. Our time-agreement measurements detail the degree of time variance between nodes and how that variance changes over time. The dataset includes empirical measurements and the accompanying spreadsheets.

  6. An intelligent allocation algorithm for parallel processing

    NASA Technical Reports Server (NTRS)

    Carroll, Chester C.; Homaifar, Abdollah; Ananthram, Kishan G.

    1988-01-01

    The problem of allocating nodes of a program graph to processors in a parallel processing architecture is considered. The algorithm is based on critical path analysis, some allocation heuristics, and the execution granularity of nodes in a program graph. These factors, and the structure of interprocessor communication network, influence the allocation. To achieve realistic estimations of the executive durations of allocations, the algorithm considers the fact that nodes in a program graph have to communicate through varying numbers of tokens. Coarse and fine granularities have been implemented, with interprocessor token-communication duration, varying from zero up to values comparable to the execution durations of individual nodes. The effect on allocation of communication network structures is demonstrated by performing allocations for crossbar (non-blocking) and star (blocking) networks. The algorithm assumes the availability of as many processors as it needs for the optimal allocation of any program graph. Hence, the focus of allocation has been on varying token-communication durations rather than varying the number of processors. The algorithm always utilizes as many processors as necessary for the optimal allocation of any program graph, depending upon granularity and characteristics of the interprocessor communication network.

  7. Parallel Implicit Algorithms for CFD

    NASA Technical Reports Server (NTRS)

    Keyes, David E.

    1998-01-01

    The main goal of this project was efficient distributed parallel and workstation cluster implementations of Newton-Krylov-Schwarz (NKS) solvers for implicit Computational Fluid Dynamics (CFD.) "Newton" refers to a quadratically convergent nonlinear iteration using gradient information based on the true residual, "Krylov" to an inner linear iteration that accesses the Jacobian matrix only through highly parallelizable sparse matrix-vector products, and "Schwarz" to a domain decomposition form of preconditioning the inner Krylov iterations with primarily neighbor-only exchange of data between the processors. Prior experience has established that Newton-Krylov methods are competitive solvers in the CFD context and that Krylov-Schwarz methods port well to distributed memory computers. The combination of the techniques into Newton-Krylov-Schwarz was implemented on 2D and 3D unstructured Euler codes on the parallel testbeds that used to be at LaRC and on several other parallel computers operated by other agencies or made available by the vendors. Early implementations were made directly in Massively Parallel Integration (MPI) with parallel solvers we adapted from legacy NASA codes and enhanced for full NKS functionality. Later implementations were made in the framework of the PETSC library from Argonne National Laboratory, which now includes pseudo-transient continuation Newton-Krylov-Schwarz solver capability (as a result of demands we made upon PETSC during our early porting experiences). A secondary project pursued with funding from this contract was parallel implicit solvers in acoustics, specifically in the Helmholtz formulation. A 2D acoustic inverse problem has been solved in parallel within the PETSC framework.

  8. Applications of the Aurora parallel Prolog system to computational molecular biology

    SciTech Connect

    Lusk, E.L.; Overbeek, R.; Mudambi, S.; Szeredi, P.

    1993-09-01

    We describe an investigation into the use of the Aurora parallel Prolog system in two applications within the area of computational molecular biology. The computational requirements were large, due to the nature of the applications, and were large, due to the nature of the applications, and were carried out on a scalable parallel computer the BBN ``Butterfly`` TC-2000. Results include both a demonstration that logic programming can be effective in the context of demanding applications on large-scale parallel machines, and some insights into parallel programming in Prolog.

  9. An efficient parallel algorithm for three-dimensional analysis of subsidence above gas reservoirs

    NASA Astrophysics Data System (ADS)

    Schrefler, B. A.; Wang, X.; Salomoni, V. A.; Zuccolo, G.

    1999-09-01

    In this paper an efficient parallel algorithm to solve a three-dimensional problem of subsidence above exploited gas reservoirs is presented. The parallel program is developed on a cluster of workstations. The parallel virtual machine (PVM) system is used to handle communications among networked workstations. The method has advantages such as numbering of the finite element mesh in an arbitrary manner, simple programming organization, smaller core requirements and computation times. An implementation of this parallel method on workstations is discussed, the speed-up and efficiency of this method being demonstrated by a numerical example. Copyright

  10. Parallel Object-Oriented Computation Applied to a Finite Element Problem

    NASA Technical Reports Server (NTRS)

    Weissman, Jon B.; Grimshaw, Andrew S.; Ferraro, Robert

    1993-01-01

    The conventional wisdom in the scientific computing community is that the best way to solve large-scale numerically intensive scientific problems on today's parallel MIMD computers is to use Fortran or C programmed in a data-parallel style using low-level message-passing primitives. This approach inevitably leads to nonportable codes, extensive development time, and restricts parallel programming to the domain of the expert programmer. We believe that these problems are not inherent to parallel computing but are the result of the tools used. We will show that comparable performance can be achieved with little effort if better tools that present higher level abstractions are used.

  11. A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)

    NASA Technical Reports Server (NTRS)

    Straeter, T. A.; Markos, A. T.

    1975-01-01

    A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.

  12. Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications

    NASA Technical Reports Server (NTRS)

    Sun, Xian-He

    1997-01-01

    Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm

  13. Parallelizing Timed Petri Net simulations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1993-01-01

    The possibility of using parallel processing to accelerate the simulation of Timed Petri Nets (TPN's) was studied. It was recognized that complex system development tools often transform system descriptions into TPN's or TPN-like models, which are then simulated to obtain information about system behavior. Viewed this way, it was important that the parallelization of TPN's be as automatic as possible, to admit the possibility of the parallelization being embedded in the system design tool. Later years of the grant were devoted to examining the problem of joint performance and reliability analysis, to explore whether both types of analysis could be accomplished within a single framework. In this final report, the results of our studies are summarized. We believe that the problem of parallelizing TPN's automatically for MIMD architectures has been almost completely solved for a large and important class of problems. Our initial investigations into joint performance/reliability analysis are two-fold; it was shown that Monte Carlo simulation, with importance sampling, offers promise of joint analysis in the context of a single tool, and methods for the parallel simulation of general Continuous Time Markov Chains, a model framework within which joint performance/reliability models can be cast, were developed. However, very much more work is needed to determine the scope and generality of these approaches. The results obtained in our two studies, future directions for this type of work, and a list of publications are included.

  14. Computing contingency statistics in parallel.

    SciTech Connect

    Bennett, Janine Camille; Thompson, David; Pebay, Philippe Pierre

    2010-09-01

    Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy, and {chi}{sup 2} independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel.We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.

  15. Automating the parallel processing of fluid and structural dynamics calculations

    NASA Technical Reports Server (NTRS)

    Arpasi, Dale J.; Cole, Gary L.

    1987-01-01

    The NASA Lewis Research Center is actively involved in the development of expert system technology to assist users in applying parallel processing to computational fluid and structural dynamic analysis. The goal of this effort is to eliminate the necessity for the physical scientist to become a computer scientist in order to effectively use the computer as a research tool. Programming and operating software utilities have previously been developed to solve systems of ordinary nonlinear differential equations on parallel scalar processors. Current efforts are aimed at extending these capabilties to systems of partial differential equations, that describe the complex behavior of fluids and structures within aerospace propulsion systems. This paper presents some important considerations in the redesign, in particular, the need for algorithms and software utilities that can automatically identify data flow patterns in the application program and partition and allocate calculations to the parallel processors. A library-oriented multiprocessing concept for integrating the hardware and software functions is described.

  16. Automating the parallel processing of fluid and structural dynamics calculations

    NASA Technical Reports Server (NTRS)

    Arpasi, Dale J.; Cole, Gary L.

    1987-01-01

    The NASA Lewis Research Center is actively involved in the development of expert system technology to assist users in applying parallel processing to computational fluid and structural dynamic analysis. The goal of this effort is to eliminate the necessity for the physical scientist to become a computer scientist in order to effectively use the computer as a research tool. Programming and operating software utilities have previously been developed to solve systems of ordinary nonlinear differential equations on parallel scalar processors. Current efforts are aimed at extending these capabilities to systems of partial differential equations, that describe the complex behavior of fluids and structures within aerospace propulsion systems. This paper presents some important considerations in the redesign, in particular, the need for algorithms and software utilities that can automatically identify data flow patterns in the application program and partition and allocate calculations to the parallel processors. A library-oriented multiprocessing concept for integrating the hardware and software functions is described.

  17. PARAVT: Parallel Voronoi tessellation code

    NASA Astrophysics Data System (ADS)

    González, R. E.

    2016-10-01

    In this study, we present a new open source code for massive parallel computation of Voronoi tessellations (VT hereafter) in large data sets. The code is focused for astrophysical purposes where VT densities and neighbors are widely used. There are several serial Voronoi tessellation codes, however no open source and parallel implementations are available to handle the large number of particles/galaxies in current N-body simulations and sky surveys. Parallelization is implemented under MPI and VT using Qhull library. Domain decomposition takes into account consistent boundary computation between tasks, and includes periodic conditions. In addition, the code computes neighbors list, Voronoi density, Voronoi cell volume, density gradient for each particle, and densities on a regular grid. Code implementation and user guide are publicly available at https://github.com/regonzar/paravt.

  18. Visualizing Parallel Computer System Performance

    NASA Technical Reports Server (NTRS)

    Malony, Allen D.; Reed, Daniel A.

    1988-01-01

    Parallel computer systems are among the most complex of man's creations, making satisfactory performance characterization difficult. Despite this complexity, there are strong, indeed, almost irresistible, incentives to quantify parallel system performance using a single metric. The fallacy lies in succumbing to such temptations. A complete performance characterization requires not only an analysis of the system's constituent levels, it also requires both static and dynamic characterizations. Static or average behavior analysis may mask transients that dramatically alter system performance. Although the human visual system is remarkedly adept at interpreting and identifying anomalies in false color data, the importance of dynamic, visual scientific data presentation has only recently been recognized Large, complex parallel system pose equally vexing performance interpretation problems. Data from hardware and software performance monitors must be presented in ways that emphasize important events while eluding irrelevant details. Design approaches and tools for performance visualization are the subject of this paper.

  19. Fast data parallel polygon rendering

    SciTech Connect

    Ortega, F.A.; Hansen, C.D.

    1993-09-01

    This paper describes a parallel method for polygonal rendering on a massively parallel SIMD machine. This method, based on a simple shading model, is targeted for applications which require very fast polygon rendering for extremely large sets of polygons such as is found in many scientific visualization applications. The algorithms described in this paper are incorporated into a library of 3D graphics routines written for the Connection Machine. The routines are implemented on both the CM-200 and the CM-5. This library enables a scientists to display 3D shaded polygons directly from a parallel machine without the need to transmit huge amounts of data to a post-processing rendering system.

  20. Parallel integrated frame synchronizer chip

    NASA Technical Reports Server (NTRS)

    Ghuman, Parminder Singh (Inventor); Solomon, Jeffrey Michael (Inventor); Bennett, Toby Dennis (Inventor)

    2000-01-01

    A parallel integrated frame synchronizer which implements a sequential pipeline process wherein serial data in the form of telemetry data or weather satellite data enters the synchronizer by means of a front-end subsystem and passes to a parallel correlator subsystem or a weather satellite data processing subsystem. When in a CCSDS mode, data from the parallel correlator subsystem passes through a window subsystem, then to a data alignment subsystem and then to a bit transition density (BTD)/cyclical redundancy check (CRC) decoding subsystem. Data from the BTD/CRC decoding subsystem or data from the weather satellite data processing subsystem is then fed to an output subsystem where it is output from a data output port.

  1. Massively Parallel MRI Detector Arrays

    PubMed Central

    Keil, Boris; Wald, Lawrence L

    2013-01-01

    Originally proposed as a method to increase sensitivity by extending the locally high-sensitivity of small surface coil elements to larger areas, the term parallel imaging now includes the use of array coils to perform image encoding. This methodology has impacted clinical imaging to the point where many examinations are performed with an array comprising multiple smaller surface coil elements as the detector of the MR signal. This article reviews the theoretical and experimental basis for the trend towards higher channel counts relying on insights gained from modeling and experimental studies as well as the theoretical analysis of the so-called “ultimate” SNR and g-factor. We also review the methods for optimally combining array data and changes in RF methodology needed to construct massively parallel MRI detector arrays and show some examples of state-of-the-art for highly accelerated imaging with the resulting highly parallel arrays. PMID:23453758

  2. Parallel algorithms for mapping pipelined and parallel computations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1988-01-01

    Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.

  3. Medipix2 parallel readout system

    NASA Astrophysics Data System (ADS)

    Fanti, V.; Marzeddu, R.; Randaccio, P.

    2003-08-01

    A fast parallel readout system based on a PCI board has been developed in the framework of the Medipix collaboration. The readout electronics consists of two boards: the motherboard directly interfacing the Medipix2 chip, and the PCI board with digital I/O ports 32 bits wide. The device driver and readout software have been developed at low level in Assembler to allow fast data transfer and image reconstruction. The parallel readout permits a transfer rate up to 64 Mbytes/s. http://medipix.web.cern ch/MEDIPIX/

  4. A Parallel Genetic Algorithm for Automated Electronic Circuit Design

    NASA Technical Reports Server (NTRS)

    Lohn, Jason D.; Colombano, Silvano P.; Haith, Gary L.; Stassinopoulos, Dimitris; Norvig, Peter (Technical Monitor)

    2000-01-01

    We describe a parallel genetic algorithm (GA) that automatically generates circuit designs using evolutionary search. A circuit-construction programming language is introduced and we show how evolution can generate practical analog circuit designs. Our system allows circuit size (number of devices), circuit topology, and device values to be evolved. We present experimental results as applied to analog filter and amplifier design tasks.

  5. An introduction to performance debugging for parallel computers

    SciTech Connect

    Gropp, W.

    1995-02-07

    Programming parallel computers for performance is a difficult task that requires careful attention to both single-node performance and data exchange between processors. This paper discusses some of the sources of poor performance, ways to identify them in an application, and a few ways to address these issues.

  6. Parallel-Processing Software for Correlating Stereo Images

    NASA Technical Reports Server (NTRS)

    Klimeck, Gerhard; Deen, Robert; Mcauley, Michael; DeJong, Eric

    2007-01-01

    A computer program implements parallel- processing algorithms for cor relating images of terrain acquired by stereoscopic pairs of digital stereo cameras on an exploratory robotic vehicle (e.g., a Mars rove r). Such correlations are used to create three-dimensional computatio nal models of the terrain for navigation. In this program, the scene viewed by the cameras is segmented into subimages. Each subimage is assigned to one of a number of central processing units (CPUs) opera ting simultaneously.

  7. A CS1 pedagogical approach to parallel thinking

    NASA Astrophysics Data System (ADS)

    Rague, Brian William

    Almost all collegiate programs in Computer Science offer an introductory course in programming primarily devoted to communicating the foundational principles of software design and development. The ACM designates this introduction to computer programming course for first-year students as CS1, during which methodologies for solving problems within a discrete computational context are presented. Logical thinking is highlighted, guided primarily by a sequential approach to algorithm development and made manifest by typically using the latest, commercially successful programming language. In response to the most recent developments in accessible multicore computers, instructors of these introductory classes may wish to include training on how to design workable parallel code. Novel issues arise when programming concurrent applications which can make teaching these concepts to beginning programmers a seemingly formidable task. Student comprehension of design strategies related to parallel systems should be monitored to ensure an effective classroom experience. This research investigated the feasibility of integrating parallel computing concepts into the first-year CS classroom. To quantitatively assess student comprehension of parallel computing, an experimental educational study using a two-factor mixed group design was conducted to evaluate two instructional interventions in addition to a control group: (1) topic lecture only, and (2) topic lecture with laboratory work using a software visualization Parallel Analysis Tool (PAT) specifically designed for this project. A new evaluation instrument developed for this study, the Perceptions of Parallelism Survey (PoPS), was used to measure student learning regarding parallel systems. The results from this educational study show a statistically significant main effect among the repeated measures, implying that student comprehension levels of parallel concepts as measured by the PoPS improve immediately after the delivery of

  8. Matpar: Parallel Extensions for MATLAB

    NASA Technical Reports Server (NTRS)

    Springer, P. L.

    1998-01-01

    Matpar is a set of client/server software that allows a MATLAB user to take advantage of a parallel computer for very large problems. The user can replace calls to certain built-in MATLAB functions with calls to Matpar functions.

  9. Performance Characteristics of the Multi-Zone NAS Parallel Benchmarks

    NASA Technical Reports Server (NTRS)

    Jin, Haoqiang; VanderWijngaart, Rob F.

    2003-01-01

    We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of grids, but had not previously been captured in bench-marks. The new suite, named NPB Multi-Zone, is extended from the NAS Parallel Benchmarks suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the Message Passing Interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on three different parallel computers. We also use an empirical formula to investigate the performance characteristics of the multi-zone benchmarks.

  10. Automatic Parallelization Using OpenMP Based on STL Semantics

    SciTech Connect

    Liao, C; Quinlan, D J; Willcock, J J; Panas, T

    2008-06-03

    Automatic parallelization of sequential applications using OpenMP as a target has been attracting significant attention recently because of the popularity of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high level abstractions such as STL containers are largely ignored due to the lack of research compilers that are readily able to recognize high level object-oriented abstractions of STL. In this paper, we use ROSE, a multiple-language source-to-source compiler infrastructure, to build a parallelizer that can recognize such high level semantics and parallelize C++ applications using certain STL containers. The idea of our work is to automatically insert OpenMP constructs using extended conventional dependence analysis and the known domain-specific semantics of high-level abstractions with optional assistance from source code annotations. In addition, the parallelizer is followed by an OpenMP translator to translate the generated OpenMP programs into multi-threaded code targeted to a popular OpenMP runtime library. Our work extends the applicability of automatic parallelization and provides another way to take advantage of multicore processors.

  11. High Performance Input/Output for Parallel Computer Systems

    NASA Technical Reports Server (NTRS)

    Ligon, W. B.

    1996-01-01

    The goal of our project is to study the I/O characteristics of parallel applications used in Earth Science data processing systems such as Regional Data Centers (RDCs) or EOSDIS. Our approach is to study the runtime behavior of typical programs and the effect of key parameters of the I/O subsystem both under simulation and with direct experimentation on parallel systems. Our three year activity has focused on two items: developing a test bed that facilitates experimentation with parallel I/O, and studying representative programs from the Earth science data processing application domain. The Parallel Virtual File System (PVFS) has been developed for use on a number of platforms including the Tiger Parallel Architecture Workbench (TPAW) simulator, The Intel Paragon, a cluster of DEC Alpha workstations, and the Beowulf system (at CESDIS). PVFS provides considerable flexibility in configuring I/O in a UNIX- like environment. Access to key performance parameters facilitates experimentation. We have studied several key applications fiom levels 1,2 and 3 of the typical RDC processing scenario including instrument calibration and navigation, image classification, and numerical modeling codes. We have also considered large-scale scientific database codes used to organize image data.

  12. Use Computer-Aided Tools to Parallelize Large CFD Applications

    NASA Technical Reports Server (NTRS)

    Jin, H.; Frumkin, M.; Yan, J.

    2000-01-01

    Porting applications to high performance parallel computers is always a challenging task. It is time consuming and costly. With rapid progressing in hardware architectures and increasing complexity of real applications in recent years, the problem becomes even more sever. Today, scalability and high performance are mostly involving handwritten parallel programs using message-passing libraries (e.g. MPI). However, this process is very difficult and often error-prone. The recent reemergence of shared memory parallel (SMP) architectures, such as the cache coherent Non-Uniform Memory Access (ccNUMA) architecture used in the SGI Origin 2000, show good prospects for scaling beyond hundreds of processors. Programming on an SMP is simplified by working in a globally accessible address space. The user can supply compiler directives, such as OpenMP, to parallelize the code. As an industry standard for portable implementation of parallel programs for SMPs, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran, C and C++ to express shared memory parallelism. It promises an incremental path for parallel conversion of existing software, as well as scalability and performance for a complete rewrite or an entirely new development. Perhaps the main disadvantage of programming with directives is that inserted directives may not necessarily enhance performance. In the worst cases, it can create erroneous results. While vendors have provided tools to perform error-checking and profiling, automation in directive insertion is very limited and often failed on large programs, primarily due to the lack of a thorough enough data dependence analysis. To overcome the deficiency, we have developed a toolkit, CAPO, to automatically insert OpenMP directives in Fortran programs and apply certain degrees of optimization. CAPO is aimed at taking advantage of detailed inter-procedural dependence analysis provided by CAPTools, developed by the University of

  13. Parallelizing AT with MatlabMPI

    SciTech Connect

    Li, Evan Y.; /Brown U. /SLAC

    2011-06-22

    The Accelerator Toolbox (AT) is a high-level collection of tools and scripts specifically oriented toward solving problems dealing with computational accelerator physics. It is integrated into the MATLAB environment, which provides an accessible, intuitive interface for accelerator physicists, allowing researchers to focus the majority of their efforts on simulations and calculations, rather than programming and debugging difficulties. Efforts toward parallelization of AT have been put in place to upgrade its performance to modern standards of computing. We utilized the packages MatlabMPI and pMatlab, which were developed by MIT Lincoln Laboratory, to set up a message-passing environment that could be called within MATLAB, which set up the necessary pre-requisites for multithread processing capabilities. On local quad-core CPUs, we were able to demonstrate processor efficiencies of roughly 95% and speed increases of nearly 380%. By exploiting the efficacy of modern-day parallel computing, we were able to demonstrate incredibly efficient speed increments per processor in AT's beam-tracking functions. Extrapolating from prediction, we can expect to reduce week-long computation runtimes to less than 15 minutes. This is a huge performance improvement and has enormous implications for the future computing power of the accelerator physics group at SSRL. However, one of the downfalls of parringpass is its current lack of transparency; the pMatlab and MatlabMPI packages must first be well-understood by the user before the system can be configured to run the scripts. In addition, the instantiation of argument parameters requires internal modification of the source code. Thus, parringpass, cannot be directly run from the MATLAB command line, which detracts from its flexibility and user-friendliness. Future work in AT's parallelization will focus on development of external functions and scripts that can be called from within MATLAB and configured on multiple nodes, while

  14. QLISP for parallel processors. Final report, 15 July 1986-31 July 1988

    SciTech Connect

    McCarthy, J.

    1989-01-01

    The goal of the QLISP project at Stanford is to gain experience with the shared-memory, queue-based approach to parallel Lisp, by implementing the QLISP language on an actual multiprocessor, and by developing a symbolic algebra system as a testbed application. The experiments performed on the simulator included: 1. Algorithms for sorting and basic data-structure manipulation for polynomials. 2. Partitioning and scheduling methods for parallel programming. 3. Parallelizing the production rule system OPS5.

  15. SIAM Conference on Parallel Processing for Scientific Computing - March 12-14, 2008

    SciTech Connect

    2008-09-08

    The themes of the 2008 conference included, but were not limited to: Programming languages, models, and compilation techniques; The transition to ubiquitous multicore/manycore processors; Scientific computing on special-purpose processors (Cell, GPUs, etc.); Architecture-aware algorithms; From scalable algorithms to scalable software; Tools for software development and performance evaluation; Global perspectives on HPC; Parallel computing in industry; Distributed/grid computing; Fault tolerance; Parallel visualization and large scale data management; and The future of parallel architectures.

  16. A Data Parallel Algorithm for XML DOM Parsing

    NASA Astrophysics Data System (ADS)

    Shah, Bhavik; Rao, Praveen R.; Moon, Bongki; Rajagopalan, Mohan

    The extensible markup language XML has become the de facto standard for information representation and interchange on the Internet. XML parsing is a core operation performed on an XML document for it to be accessed and manipulated. This operation is known to cause performance bottlenecks in applications and systems that process large volumes of XML data. We believe that parallelism is a natural way to boost performance. Leveraging multicore processors can offer a cost-effective solution, because future multicore processors will support hundreds of cores, and will offer a high degree of parallelism in hardware. We propose a data parallel algorithm called ParDOM for XML DOM parsing, that builds an in-memory tree structure for an XML document. ParDOM has two phases. In the first phase, an XML document is partitioned into chunks and parsed in parallel. In the second phase, partial DOM node tree structures created during the first phase, are linked together (in parallel) to build a complete DOM node tree. ParDOM offers fine-grained parallelism by adopting a flexible chunking scheme - each chunk can contain an arbitrary number of start and end XML tags that are not necessarily matched. ParDOM can be conveniently implemented using a data parallel programming model that supports map and sort operations. Through empirical evaluation, we show that ParDOM yields better scalability than PXP [23] - a recently proposed parallel DOM parsing algorithm - on commodity multicore processors. Furthermore, ParDOM can process a wide-variety of XML datasets with complex structures which PXP fails to parse.

  17. Hybrid Parallelism for Volume Rendering on Large, Multi-core Systems

    SciTech Connect

    Howison, Mark; Bethel, E. Wes; Childs, Hank

    2010-07-12

    This work studies the performance and scalability characteristics of"hybrid'"parallel programming and execution as applied to raycasting volume rendering -- a staple visualization algorithm -- on a large, multi-core platform. Historically, the Message Passing Interface (MPI) has become the de-facto standard for parallel programming and execution on modern parallel systems. As the computing industry trends towards multi-core processors, with four- and six-core chips common today and 128-core chips coming soon, we wish to better understand how algorithmic and parallel programming choices impact performance and scalability on large, distributed-memory multi-core systems. Our findings indicate that the hybrid-parallel implementation, at levels of concurrency ranging from 1,728 to 216,000, performs better, uses a smaller absolute memory footprint, and consumes less communication bandwidth than the traditional, MPI-only implementation.

  18. MPI-hybrid Parallelism for Volume Rendering on Large, Multi-core Systems

    SciTech Connect

    Howison, Mark; Bethel, E. Wes; Childs, Hank

    2010-03-20

    This work studies the performance and scalability characteristics of"hybrid'" parallel programming and execution as applied to raycasting volume rendering -- a staple visualization algorithm -- on a large, multi-core platform. Historically, the Message Passing Interface (MPI) has become the de-facto standard for parallel programming and execution on modern parallel systems. As the computing industry trends towards multi-core processors, with four- and six-core chips common today and 128-core chips coming soon, we wish to better understand how algorithmic and parallel programming choices impact performance and scalability on large, distributed-memory multi-core systems. Our findings indicate that the hybrid-parallel implementation, at levels of concurrency ranging from 1,728 to 216,000, performs better, uses a smaller absolute memory footprint, and consumes less communication bandwidth than the traditional, MPI-only implementation.

  19. Hybrid Parallelism for Volume Rendering on Large, Multi-core Systems

    SciTech Connect

    Howison, Mark; Bethel, E. Wes; Childs, Hank

    2010-06-14

    This work studies the performance and scalability characteristics of"hybrid" parallel programming and execution as applied to raycasting volume rendering -- a staple visualization algorithm -- on a large, multi-core platform. Historically, the Message Passing Interface (MPI) has become the de-facto standard for parallel programming and execution on modern parallel systems. As the computing industry trends towards multi-core processors, with four- and six-core chips common today and 128-core chips coming soon, we wish to better understand how algorithmic and parallel programming choices impact performance and scalability on large, distributed-memory multi-core systems. Our findings indicate that the hybrid-parallel implementation, at levels of concurrency ranging from 1,728 to 216,000, performs better, uses a smaller absolute memory footprint, and consumes less communication bandwidth than the traditional, MPI-only implementation.

  20. Scalable Performance Environments for Parallel Systems

    NASA Technical Reports Server (NTRS)

    Reed, Daniel A.; Olson, Robert D.; Aydt, Ruth A.; Madhyastha, Tara M.; Birkett, Thomas; Jensen, David W.; Nazief, Bobby A. A.; Totty, Brian K.

    1991-01-01

    As parallel systems expand in size and complexity, the absence of performance tools for these parallel systems exacerbates the already difficult problems of application program and system software performance tuning. Moreover, given the pace of technological change, we can no longer afford to develop ad hoc, one-of-a-kind performance instrumentation software; we need scalable, portable performance analysis tools. We describe an environment prototype based on the lessons learned from two previous generations of performance data analysis software. Our environment prototype contains a set of performance data transformation modules that can be interconnected in user-specified ways. It is the responsibility of the environment infrastructure to hide details of module interconnection and data sharing. The environment is written in C++ with the graphical displays based on X windows and the Motif toolkit. It allows users to interconnect and configure modules graphically to form an acyclic, directed data analysis graph. Performance trace data are represented in a self-documenting stream format that includes internal definitions of data types, sizes, and names. The environment prototype supports the use of head-mounted displays and sonic data presentation in addition to the traditional use of visual techniques.

  1. Empirical study of parallel LRU simulation algorithms

    NASA Technical Reports Server (NTRS)

    Carr, Eric; Nicol, David M.

    1994-01-01

    This paper reports on the performance of five parallel algorithms for simulating a fully associative cache operating under the LRU (Least-Recently-Used) replacement policy. Three of the algorithms are SIMD, and are implemented on the MasPar MP-2 architecture. Two other algorithms are parallelizations of an efficient serial algorithm on the Intel Paragon. One SIMD algorithm is quite simple, but its cost is linear in the cache size. The two other SIMD algorithm are more complex, but have costs that are independent on the cache size. Both the second and third SIMD algorithms compute all stack distances; the second SIMD algorithm is completely general, whereas the third SIMD algorithm presumes and takes advantage of bounds on the range of reference tags. Both MIMD algorithm implemented on the Paragon are general and compute all stack distances; they differ in one step that may affect their respective scalability. We assess the strengths and weaknesses of these algorithms as a function of problem size and characteristics, and compare their performance on traces derived from execution of three SPEC benchmark programs.

  2. Cosmic Shear - with ACS Pure Parallel Observations

    NASA Astrophysics Data System (ADS)

    Ratnatunga, Kavan

    2002-07-01

    The ACS, with greater sensitivity and sky coverage, will extend our ability to measure the weak gravitational lensing of galaxy images caused by the large scale distribution of dark matter. We propose to use the ACS in pure parallel {non- proprietary} mode, following the guidelines of the ACS Default Pure Parallel Program. Using the HST Medium Deep Survey WFPC2 database we have measured cosmic shear at arc-min angular scales. The MDS image parameters, in particular the galaxy orientations and axis ratios, are such that any residual corrections due to errors in the PSF or jitter are much smaller than the measured signal. This situation is in stark contrast with ground-based observations. We have also developed a statistical analysis procedure to derive unbiased estimates of cosmic shear from a large number of fields, each of which has a very small number of galaxies. We have therefore set the stage for measurements with the ACS at fainter apparent magnitudes and smaller, 10 arc-second scales corresponding to larger cosmological distances. We will adapt existing MDS WFPC2 maximum likelihood galaxy image analysis algorithms to work with the ACS. The analysis would also yield an online database similar to that in archive.stsci.edu/mds/

  3. Parallelizing serial code for a distributed processing environment with an application to high frequency electromagnetic scattering

    NASA Astrophysics Data System (ADS)

    Work, Paul R.

    1991-12-01

    This thesis investigates the parallelization of existing serial programs in computational electromagnetics for use in a parallel environment. Existing algorithms for calculating the radar cross section of an object are covered, and a ray-tracing code is chosen for implementation on a parallel machine. Current parallel architectures are introduced and a suitable parallel machine is selected for the implementation of the chosen ray-tracing algorithm. The standard techniques for the parallelization of serial codes are discussed, including load balancing and decomposition considerations, and appropriate methods for the parallelization effort are selected. A load balancing algorithm is modified to increase the efficiency of the application, and a high level design of the structure of the serial program is presented. A detailed design of the modifications for the parallel implementation is also included, with both the high level and the detailed design specified in a high level design language called UNITY. The correctness of the design is proven using UNITY and standard logic operations. The theoretical and empirical results show that it is possible to achieve an efficient parallel application for a serial computational electromagnetic program where the characteristics of the algorithm and the target architecture critically influence the development of such an implementation.

  4. Parallel aeroelastic computations for wing and wing-body configurations

    NASA Technical Reports Server (NTRS)

    Byun, Chansup

    1994-01-01

    The objective of this research is to develop computationally efficient methods for solving fluid-structural interaction problems by directly coupling finite difference Euler/Navier-Stokes equations for fluids and finite element dynamics equations for structures on parallel computers. This capability will significantly impact many aerospace projects of national importance such as Advanced Subsonic Civil Transport (ASCT), where the structural stability margin becomes very critical at the transonic region. This research effort will have direct impact on the High Performance Computing and Communication (HPCC) Program of NASA in the area of parallel computing.

  5. Methods for operating parallel computing systems employing sequenced communications

    DOEpatents

    Benner, R.E.; Gustafson, J.L.; Montry, G.R.

    1999-08-10

    A parallel computing system and method are disclosed having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system. 15 figs.

  6. Methods for operating parallel computing systems employing sequenced communications

    DOEpatents

    Benner, Robert E.; Gustafson, John L.; Montry, Gary R.

    1999-01-01

    A parallel computing system and method having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system.

  7. A design methodology for portable software on parallel computers

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Miller, Keith W.; Chrisman, Dan A.

    1993-01-01

    This final report for research that was supported by grant number NAG-1-995 documents our progress in addressing two difficulties in parallel programming. The first difficulty is developing software that will execute quickly on a parallel computer. The second difficulty is transporting software between dissimilar parallel computers. In general, we expect that more hardware-specific information will be included in software designs for parallel computers than in designs for sequential computers. This inclusion is an instance of portability being sacrificed for high performance. New parallel computers are being introduced frequently. Trying to keep one's software on the current high performance hardware, a software developer almost continually faces yet another expensive software transportation. The problem of the proposed research is to create a design methodology that helps designers to more precisely control both portability and hardware-specific programming details. The proposed research emphasizes programming for scientific applications. We completed our study of the parallelizability of a subsystem of the NASA Earth Radiation Budget Experiment (ERBE) data processing system. This work is summarized in section two. A more detailed description is provided in Appendix A ('Programming Practices to Support Eventual Parallelism'). Mr. Chrisman, a graduate student, wrote and successfully defended a Ph.D. dissertation proposal which describes our research associated with the issues of software portability and high performance. The list of research tasks are specified in the proposal. The proposal 'A Design Methodology for Portable Software on Parallel Computers' is summarized in section three and is provided in its entirety in Appendix B. We are currently studying a proposed subsystem of the NASA Clouds and the Earth's Radiant Energy System (CERES) data processing system. This software is the proof-of-concept for the Ph.D. dissertation. We have implemented and measured

  8. Parallel and Distributed Computational Fluid Dynamics: Experimental Results and Challenges

    NASA Technical Reports Server (NTRS)

    Djomehri, Mohammad Jahed; Biswas, R.; VanderWijngaart, R.; Yarrow, M.

    2000-01-01

    This paper describes several results of parallel and distributed computing using a large scale production flow solver program. A coarse grained parallelization based on clustering of discretization grids combined with partitioning of large grids for load balancing is presented. An assessment is given of its performance on distributed and distributed-shared memory platforms using large scale scientific problems. An experiment with this solver, adapted to a Wide Area Network execution environment is presented. We also give a comparative performance assessment of computation and communication times on both the tightly and loosely-coupled machines.

  9. A portable MPI-based parallel vector template library

    NASA Technical Reports Server (NTRS)

    Sheffler, Thomas J.

    1995-01-01

    This paper discusses the design and implementation of a polymorphic collection library for distributed address-space parallel computers. The library provides a data-parallel programming model for C++ by providing three main components: a single generic collection class, generic algorithms over collections, and generic algebraic combining functions. Collection elements are the fourth component of a program written using the library and may be either of the built-in types of C or of user-defined types. Many ideas are borrowed from the Standard Template Library (STL) of C++, although a restricted programming model is proposed because of the distributed address-space memory model assumed. Whereas the STL provides standard collections and implementations of algorithms for uniprocessors, this paper advocates standardizing interfaces that may be customized for different parallel computers. Just as the STL attempts to increase programmer productivity through code reuse, a similar standard for parallel computers could provide programmers with a standard set of algorithms portable across many different architectures. The efficacy of this approach is verified by examining performance data collected from an initial implementation of the library running on an IBM SP-2 and an Intel Paragon.

  10. A Portable MPI-Based Parallel Vector Template Library

    NASA Technical Reports Server (NTRS)

    Sheffler, Thomas J.

    1995-01-01

    This paper discusses the design and implementation of a polymorphic collection library for distributed address-space parallel computers. The library provides a data-parallel programming model for C + + by providing three main components: a single generic collection class, generic algorithms over collections, and generic algebraic combining functions. Collection elements are the fourth component of a program written using the library and may be either of the built-in types of c or of user-defined types. Many ideas are borrowed from the Standard Template Library (STL) of C++, although a restricted programming model is proposed because of the distributed address-space memory model assumed. Whereas the STL provides standard collections and implementations of algorithms for uniprocessors, this paper advocates standardizing interfaces that may be customized for different parallel computers. Just as the STL attempts to increase programmer productivity through code reuse, a similar standard for parallel computers could provide programmers with a standard set of algorithms portable across many different architectures. The efficacy of this approach is verified by examining performance data collected from an initial implementation of the library running on an IBM SP-2 and an Intel Paragon.

  11. Parallel line analysis: multifunctional software for the biomedical sciences

    NASA Technical Reports Server (NTRS)

    Swank, P. R.; Lewis, M. L.; Damron, K. L.; Morrison, D. R.

    1990-01-01

    An easy to use, interactive FORTRAN program for analyzing the results of parallel line assays is described. The program is menu driven and consists of five major components: data entry, data editing, manual analysis, manual plotting, and automatic analysis and plotting. Data can be entered from the terminal or from previously created data files. The data editing portion of the program is used to inspect and modify data and to statistically identify outliers. The manual analysis component is used to test the assumptions necessary for parallel line assays using analysis of covariance techniques and to determine potency ratios with confidence limits. The manual plotting component provides a graphic display of the data on the terminal screen or on a standard line printer. The automatic portion runs through multiple analyses without operator input. Data may be saved in a special file to expedite input at a future time.

  12. Parallel processing for digital picture comparison

    NASA Technical Reports Server (NTRS)

    Cheng, H. D.; Kou, L. T.

    1987-01-01

    In picture processing an important problem is to identify two digital pictures of the same scene taken under different lighting conditions. This kind of problem can be found in remote sensing, satellite signal processing and the related areas. The identification can be done by transforming the gray levels so that the gray level histograms of the two pictures are closely matched. The transformation problem can be solved by using the packing method. Researchers propose a VLSI architecture consisting of m x n processing elements with extensive parallel and pipelining computation capabilities to speed up the transformation with the time complexity 0(max(m,n)), where m and n are the numbers of the gray levels of the input picture and the reference picture respectively. If using uniprocessor and a dynamic programming algorithm, the time complexity will be 0(m(3)xn). The algorithm partition problem, as an important issue in VLSI design, is discussed. Verification of the proposed architecture is also given.

  13. A brief parallel I/O tutorial.

    SciTech Connect

    Ward, H. Lee

    2010-03-01

    This document provides common best practices for the efficient utilization of parallel file systems for analysts and application developers. A multi-program, parallel supercomputer is able to provide effective compute power by aggregating a host of lower-power processors using a network. The idea, in general, is that one either constructs the application to distribute parts to the different nodes and processors available and then collects the result (a parallel application), or one launches a large number of small jobs, each doing similar work on different subsets (a campaign). The I/O system on these machines is usually implemented as a tightly-coupled, parallel application itself. It is providing the concept of a 'file' to the host applications. The 'file' is an addressable store of bytes and that address space is global in nature. In essence, it is providing a global address space. Beyond the simple reality that the I/O system is normally composed of a small, less capable, collection of hardware, that concept of a global address space will cause problems if not very carefully utilized. How much of a problem and the ways in which those problems manifest will be different, but that it is problem prone has been well established. Worse, the file system is a shared resource on the machine - a system service. What an application does when it uses the file system impacts all users. It is not the case that some portion of the available resource is reserved. Instead, the I/O system responds to requests by scheduling and queuing based on instantaneous demand. Using the system well contributes to the overall throughput on the machine. From a solely self-centered perspective, using it well reduces the time that the application or campaign is subject to impact by others. The developer's goal should be to accomplish I/O in a way that minimizes interaction with the I/O system, maximizes the amount of data moved per call, and provides the I/O system the most information about

  14. A generalized parallel replica dynamics

    SciTech Connect

    Binder, Andrew; Lelièvre, Tony; Simpson, Gideon

    2015-03-01

    Metastability is a common obstacle to performing long molecular dynamics simulations. Many numerical methods have been proposed to overcome it. One method is parallel replica dynamics, which relies on the rapid convergence of the underlying stochastic process to a quasi-stationary distribution. Two requirements for applying parallel replica dynamics are knowledge of the time scale on which the process converges to the quasi-stationary distribution and a mechanism for generating samples from this distribution. By combining a Fleming–Viot particle system with convergence diagnostics to simultaneously identify when the process converges while also generating samples, we can address both points. This variation on the algorithm is illustrated with various numerical examples, including those with entropic barriers and the 2D Lennard-Jones cluster of seven atoms.

  15. Scans as primitive parallel operations

    SciTech Connect

    Blelloch, G.E. . Dept. of Computer Science)

    1989-11-01

    In most parallel random access machine (PRAM) models, memory references are assumed to take unit time. In practice, and in theory, certain scan operations, also known as prefix computations, can execute in no more time than these parallel memory references. This paper outlines an extensive study of the effect of including, in the PRAM models, such scan operations as unit-time primitives. The study concludes that the primitives improve the asymptotic running time of many algorithms by an O(log n) factor greatly simplify the description of many algorithms, and are significantly easier to implement than memory references. The authors argue that the algorithm designer should feel free to use these operations as if they were as cheap as a memory reference. This paper describes five algorithms that clearly illustrate how the scan primitives can be used in algorithm design. These all run on an EREW PRAM with the addition of two scan primitives.

  16. Parallel multiplex laser feedback interferometry

    SciTech Connect

    Zhang, Song; Tan, Yidong; Zhang, Shulian

    2013-12-15

    We present a parallel multiplex laser feedback interferometer based on spatial multiplexing which avoids the signal crosstalk in the former feedback interferometer. The interferometer outputs two close parallel laser beams, whose frequencies are shifted by two acousto-optic modulators by 2Ω simultaneously. A static reference mirror is inserted into one of the optical paths as the reference optical path. The other beam impinges on the target as the measurement optical path. Phase variations of the two feedback laser beams are simultaneously measured through heterodyne demodulation with two different detectors. Their subtraction accurately reflects the target displacement. Under typical room conditions, experimental results show a resolution of 1.6 nm and accuracy of 7.8 nm within the range of 100 μm.

  17. Parallel processing spacecraft communication system

    NASA Technical Reports Server (NTRS)

    Bolotin, Gary S. (Inventor); Donaldson, James A. (Inventor); Luong, Huy H. (Inventor); Wood, Steven H. (Inventor)

    1998-01-01

    An uplink controlling assembly speeds data processing using a special parallel codeblock technique. A correct start sequence initiates processing of a frame. Two possible start sequences can be used; and the one which is used determines whether data polarity is inverted or non-inverted. Processing continues until uncorrectable errors are found. The frame ends by intentionally sending a block with an uncorrectable error. Each of the codeblocks in the frame has a channel ID. Each channel ID can be separately processed in parallel. This obviates the problem of waiting for error correction processing. If that channel number is zero, however, it indicates that the frame of data represents a critical command only. That data is handled in a special way, independent of the software. Otherwise, the processed data further handled using special double buffering techniques to avoid problems from overrun. When overrun does occur, the system takes action to lose only the oldest data.

  18. Parallel Power Grid Simulation Toolkit

    SciTech Connect

    Smith, Steve; Kelley, Brian; Banks, Lawrence; Top, Philip; Woodward, Carol

    2015-09-14

    ParGrid is a 'wrapper' that integrates a coupled Power Grid Simulation toolkit consisting of a library to manage the synchronization and communication of independent simulations. The included library code in ParGid, named FSKIT, is intended to support the coupling multiple continuous and discrete even parallel simulations. The code is designed using modern object oriented C++ methods utilizing C++11 and current Boost libraries to ensure compatibility with multiple operating systems and environments.

  19. Parallel fabrication of nanogap electrodes.

    PubMed

    Johnston, Danvers E; Strachan, Douglas R; Johnson, A T Charlie

    2007-09-01

    We have developed a technique for simultaneously fabricating large numbers of nanogaps in a single processing step using feedback-controlled electromigration. Parallel nanogap formation is achieved by a balanced simultaneous process that uses a novel arrangement of nanoscale shorts between narrow constrictions where the nanogaps form. Because of this balancing, the fabrication of multiple nanoelectrodes is similar to that of a single nanogap junction. The technique should be useful for constructing complex circuits of molecular-scale electronic devices.

  20. Experiences from Participants in Large-Scale Group Practice of the Maharishi Transcendental Meditation and TM-Sidhi Programs and Parallel Principles of Quantum Theory, Astrophysics, Quantum Cosmology, and String Theory: Interdisciplinary Qualitative Correspondences

    NASA Astrophysics Data System (ADS)

    Svenson, Eric Johan

    Participants on the Invincible America Assembly in Fairfield, Iowa, and neighboring Maharishi Vedic City, Iowa, practicing Maharishi Transcendental Meditation(TM) (TM) and the TM-Sidhi(TM) programs in large groups, submitted written experiences that they had had during, and in some cases shortly after, their daily practice of the TM and TM-Sidhi programs. Participants were instructed to include in their written experiences only what they observed and to leave out interpretation and analysis. These experiences were then read by the author and compared with principles and phenomena of modern physics, particularly with quantum theory, astrophysics, quantum cosmology, and string theory as well as defining characteristics of higher states of consciousness as described by Maharishi Vedic Science. In all cases, particular principles or phenomena of physics and qualities of higher states of consciousness appeared qualitatively quite similar to the content of the given experience. These experiences are presented in an Appendix, in which the corresponding principles and phenomena of physics are also presented. These physics "commentaries" on the experiences were written largely in layman's terms, without equations, and, in nearly every case, with clear reference to the corresponding sections of the experiences to which a given principle appears to relate. An abundance of similarities were apparent between the subjective experiences during meditation and principles of modern physics. A theoretic framework for understanding these rich similarities may begin with Maharishi's theory of higher states of consciousness provided herein. We conclude that the consistency and richness of detail found in these abundant similarities warrants the further pursuit and development of such a framework.