distributed memory message-passing: Topics by Science.gov

Sample records for distributed memory message-passing

Combining Distributed and Shared Memory Models: Approach and Evolution of the Global Arrays Toolkit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nieplocha, Jarek; Harrison, Robert J.; Kumar, Mukul

2002-07-29

Both shared memory and distributed memory models have advantages and shortcomings. Shared memory model is much easier to use but it ignores data locality/placement. Given the hierarchical nature of the memory subsystems in the modern computers this characteristic might have a negative impact on performance and scalability. Various techniques, such as code restructuring to increase data reuse and introducing blocking in data accesses, can address the problem and yield performance competitive with message passing[Singh], however at the cost of compromising the ease of use feature. Distributed memory models such as message passing or one-sided communication offer performance and scalability butmore » they compromise the ease-of-use. In this context, the message-passing model is sometimes referred to as?assembly programming for the scientific computing?. The Global Arrays toolkit[GA1, GA2] attempts to offer the best features of both models. It implements a shared-memory programming model in which data locality is managed explicitly by the programmer. This management is achieved by explicit calls to functions that transfer data between a global address space (a distributed array) and local storage. In this respect, the GA model has similarities to the distributed shared-memory models that provide an explicit acquire/release protocol. However, the GA model acknowledges that remote data is slower to access than local data and allows data locality to be explicitly specified and hence managed. The GA model exposes to the programmer the hierarchical memory of modern high-performance computer systems, and by recognizing the communication overhead for remote data transfer, it promotes data reuse and locality of reference. This paper describes the characteristics of the Global Arrays programming model, capabilities of the toolkit, and discusses its evolution.« less
A message passing kernel for the hypercluster parallel processing test bed

NASA Technical Reports Server (NTRS)

Blech, Richard A.; Quealy, Angela; Cole, Gary L.

1989-01-01

A Message-Passing Kernel (MPK) for the Hypercluster parallel-processing test bed is described. The Hypercluster is being developed at the NASA Lewis Research Center to support investigations of parallel algorithms and architectures for computational fluid and structural mechanics applications. The Hypercluster resembles the hypercube architecture except that each node consists of multiple processors communicating through shared memory. The MPK efficiently routes information through the Hypercluster, using a message-passing protocol when necessary and faster shared-memory communication whenever possible. The MPK also interfaces all of the processors with the Hypercluster operating system (HYCLOPS), which runs on a Front-End Processor (FEP). This approach distributes many of the I/O tasks to the Hypercluster processors and eliminates the need for a separate I/O support program on the FEP.
CSlib, a library to couple codes via Client/Server messaging

DOE Office of Scientific and Technical Information (OSTI.GOV)

Plimpton, Steve

The CSlib is a small, portable library which enables two (or more) independent simulation codes to be coupled, by exchanging messages with each other. Both codes link to the library when they are built, and can them communicate with each other as they run. The messages contain data or instructions that the two codes send back-and-forth to each other. The messaging can take place via files, sockets, or MPI. The latter is a standard distributed-memory message-passing library.
Experiences Using OpenMP Based on Compiler Directed Software DSM on a PC Cluster

NASA Technical Reports Server (NTRS)

Hess, Matthias; Jost, Gabriele; Mueller, Matthias; Ruehle, Roland; Biegel, Bryan (Technical Monitor)

2002-01-01

In this work we report on our experiences running OpenMP (message passing) programs on a commodity cluster of PCs (personal computers) running a software distributed shared memory (DSM) system. We describe our test environment and report on the performance of a subset of the NAS (NASA Advanced Supercomputing) Parallel Benchmarks that have been automatically parallelized for OpenMP. We compare the performance of the OpenMP implementations with that of their message passing counterparts and discuss performance differences.
Parallelization of KENO-Va Monte Carlo code

NASA Astrophysics Data System (ADS)

Ramón, Javier; Peña, Jorge

1995-07-01

KENO-Va is a code integrated within the SCALE system developed by Oak Ridge that solves the transport equation through the Monte Carlo Method. It is being used at the Consejo de Seguridad Nuclear (CSN) to perform criticality calculations for fuel storage pools and shipping casks. Two parallel versions of the code: one for shared memory machines and other for distributed memory systems using the message-passing interface PVM have been generated. In both versions the neutrons of each generation are tracked in parallel. In order to preserve the reproducibility of the results in both versions, advanced seeds for random numbers were used. The CONVEX C3440 with four processors and shared memory at CSN was used to implement the shared memory version. A FDDI network of 6 HP9000/735 was employed to implement the message-passing version using proprietary PVM. The speedup obtained was 3.6 in both cases.
MPF: A portable message passing facility for shared memory multiprocessors

NASA Technical Reports Server (NTRS)

Malony, Allen D.; Reed, Daniel A.; Mcguire, Patrick J.

1987-01-01

The design, implementation, and performance evaluation of a message passing facility (MPF) for shared memory multiprocessors are presented. The MPF is based on a message passing model conceptually similar to conversations. Participants (parallel processors) can enter or leave a conversation at any time. The message passing primitives for this model are implemented as a portable library of C function calls. The MPF is currently operational on a Sequent Balance 21000, and several parallel applications were developed and tested. Several simple benchmark programs are presented to establish interprocess communication performance for common patterns of interprocess communication. Finally, performance figures are presented for two parallel applications, linear systems solution, and iterative solution of partial differential equations.
Supporting shared data structures on distributed memory architectures

NASA Technical Reports Server (NTRS)

Koelbel, Charles; Mehrotra, Piyush; Vanrosendale, John

1990-01-01

Programming nonshared memory systems is more difficult than programming shared memory systems, since there is no support for shared data structures. Current programming languages for distributed memory architectures force the user to decompose all data structures into separate pieces, with each piece owned by one of the processors in the machine, and with all communication explicitly specified by low-level message-passing primitives. A new programming environment is presented for distributed memory architectures, providing a global name space and allowing direct access to remote parts of data values. The analysis and program transformations required to implement this environment are described, and the efficiency of the resulting code on the NCUBE/7 and IPSC/2 hypercubes are described.
Reducing Interprocessor Dependence in Recoverable Distributed Shared Memory

NASA Technical Reports Server (NTRS)

Janssens, Bob; Fuchs, W. Kent

1994-01-01

Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensure that a system rolls back to a consistent state. Traditional dependency tracking in distributed shared memory (DSM) systems is expensive because of high communication frequency. In this paper we show that, if designed correctly, a DSM system only needs to consider dependencies due to the transfer of blocks of data, resulting in reduced dependency tracking overhead and reduced potential for rollback propagation. We develop an ownership timestamp scheme to tolerate the loss of block state information and develop a passive server model of execution where interactions between processors are considered atomic. With our scheme, dependencies are significantly reduced compared to the traditional message-passing model.
Users manual for the Chameleon parallel programming tools

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gropp, W.; Smith, B.

1993-06-01

Message passing is a common method for writing programs for distributed-memory parallel computers. Unfortunately, the lack of a standard for message passing has hampered the construction of portable and efficient parallel programs. In an attempt to remedy this problem, a number of groups have developed their own message-passing systems, each with its own strengths and weaknesses. Chameleon is a second-generation system of this type. Rather than replacing these existing systems, Chameleon is meant to supplement them by providing a uniform way to access many of these systems. Chameleon`s goals are to (a) be very lightweight (low over-head), (b) be highlymore » portable, and (c) help standardize program startup and the use of emerging message-passing operations such as collective operations on subsets of processors. Chameleon also provides a way to port programs written using PICL or Intel NX message passing to other systems, including collections of workstations. Chameleon is tracking the Message-Passing Interface (MPI) draft standard and will provide both an MPI implementation and an MPI transport layer. Chameleon provides support for heterogeneous computing by using p4 and PVM. Chameleon`s support for homogeneous computing includes the portable libraries p4, PICL, and PVM and vendor-specific implementation for Intel NX, IBM EUI (SP-1), and Thinking Machines CMMD (CM-5). Support for Ncube and PVM 3.x is also under development.« less
Cooperative Data Sharing: Simple Support for Clusters of SMP Nodes

NASA Technical Reports Server (NTRS)

DiNucci, David C.; Balley, David H. (Technical Monitor)

1997-01-01

Libraries like PVM and MPI send typed messages to allow for heterogeneous cluster computing. Lower-level libraries, such as GAM, provide more efficient access to communication by removing the need to copy messages between the interface and user space in some cases. still lower-level interfaces, such as UNET, get right down to the hardware level to provide maximum performance. However, these are all still interfaces for passing messages from one process to another, and have limited utility in a shared-memory environment, due primarily to the fact that message passing is just another term for copying. This drawback is made more pertinent by today's hybrid architectures (e.g. clusters of SMPs), where it is difficult to know beforehand whether two communicating processes will share memory. As a result, even portable language tools (like HPF compilers) must either map all interprocess communication, into message passing with the accompanying performance degradation in shared memory environments, or they must check each communication at run-time and implement the shared-memory case separately for efficiency. Cooperative Data Sharing (CDS) is a single user-level API which abstracts all communication between processes into the sharing and access coordination of memory regions, in a model which might be described as "distributed shared messages" or "large-grain distributed shared memory". As a result, the user programs to a simple latency-tolerant abstract communication specification which can be mapped efficiently to either a shared-memory or message-passing based run-time system, depending upon the available architecture. Unlike some distributed shared memory interfaces, the user still has complete control over the assignment of data to processors, the forwarding of data to its next likely destination, and the queuing of data until it is needed, so even the relatively high latency present in clusters can be accomodated. CDS does not require special use of an MMU, which can add overhead to some DSM systems, and does not require an SPMD programming model. unlike some message-passing interfaces, CDS allows the user to implement efficient demand-driven applications where processes must "fight" over data, and does not perform copying if processes share memory and do not attempt concurrent writes. CDS also supports heterogeneous computing, dynamic process creation, handlers, and a very simple thread-arbitration mechanism. Additional support for array subsections is currently being considered. The CDS1 API, which forms the kernel of CDS, is built primarily upon only 2 communication primitives, one process initiation primitive, and some data translation (and marshalling) routines, memory allocation routines, and priority control routines. The entire current collection of 28 routines provides enough functionality to implement most (or all) of MPI 1 and 2, which has a much larger interface consisting of hundreds of routines. still, the API is small enough to consider integrating into standard os interfaces for handling inter-process communication in a network-independent way. This approach would also help to solve many of the problems plaguing other higher-level standards such as MPI and PVM which must, in some cases, "play OS" to adequately address progress and process control issues. The CDS2 API, a higher level of interface roughly equivalent in functionality to MPI and to be built entirely upon CDS1, is still being designed. It is intended to add support for the equivalent of communicators, reduction and other collective operations, process topologies, additional support for process creation, and some automatic memory management. CDS2 will not exactly match MPI, because the copy-free semantics of communication from CDS1 will be supported. CDS2 application programs will be free to carefully also use CDS1. CDS1 has been implemented on networks of workstations running unmodified Unix-based operating systems, using UDP/IP and vendor-supplied high- performance locks. Although its inter-node performance is currently unimpressive due to rudimentary implementation technique, it even now outperforms highly-optimized MPI implementation on intra-node communication due to its support for non-copy communication. The similarity of the CDS1 architecture to that of other projects such as UNET and TRAP suggests that the inter-node performance can be increased significantly to surpass MPI or PVM, and it may be possible to migrate some of its functionality to communication controllers.
Distributed memory compiler design for sparse problems

NASA Technical Reports Server (NTRS)

Wu, Janet; Saltz, Joel; Berryman, Harry; Hiranandani, Seema

1991-01-01

A compiler and runtime support mechanism is described and demonstrated. The methods presented are capable of solving a wide range of sparse and unstructured problems in scientific computing. The compiler takes as input a FORTRAN 77 program enhanced with specifications for distributing data, and the compiler outputs a message passing program that runs on a distributed memory computer. The runtime support for this compiler is a library of primitives designed to efficiently support irregular patterns of distributed array accesses and irregular distributed array partitions. A variety of Intel iPSC/860 performance results obtained through the use of this compiler are presented.
Monitoring Data-Structure Evolution in Distributed Message-Passing Programs

NASA Technical Reports Server (NTRS)

Sarukkai, Sekhar R.; Beers, Andrew; Woodrow, Thomas S. (Technical Monitor)

1996-01-01

Monitoring the evolution of data structures in parallel and distributed programs, is critical for debugging its semantics and performance. However, the current state-of-art in tracking and presenting data-structure information on parallel and distributed environments is cumbersome and does not scale. In this paper we present a methodology that automatically tracks memory bindings (not the actual contents) of static and dynamic data-structures of message-passing C programs, using PVM. With the help of a number of examples we show that in addition to determining the impact of memory allocation overheads on program performance, graphical views can help in debugging the semantics of program execution. Scalable animations of virtual address bindings of source-level data-structures are used for debugging the semantics of parallel programs across all processors. In conjunction with light-weight core-files, this technique can be used to complement traditional debuggers on single processors. Detailed information (such as data-structure contents), on specific nodes, can be determined using traditional debuggers after the data structure evolution leading to the semantic error is observed graphically.
Incremental Parallelization of Non-Data-Parallel Programs Using the Charon Message-Passing Library

NASA Technical Reports Server (NTRS)

VanderWijngaart, Rob F.

2000-01-01

Message passing is among the most popular techniques for parallelizing scientific programs on distributed-memory architectures. The reasons for its success are wide availability (MPI), efficiency, and full tuning control provided to the programmer. A major drawback, however, is that incremental parallelization, as offered by compiler directives, is not generally possible, because all data structures have to be changed throughout the program simultaneously. Charon remedies this situation through mappings between distributed and non-distributed data. It allows breaking up the parallelization into small steps, guaranteeing correctness at every stage. Several tools are available to help convert legacy codes into high-performance message-passing programs. They usually target data-parallel applications, whose loops carrying most of the work can be distributed among all processors without much dependency analysis. Others do a full dependency analysis and then convert the code virtually automatically. Even more toolkits are available that aid construction from scratch of message passing programs. None, however, allows piecemeal translation of codes with complex data dependencies (i.e. non-data-parallel programs) into message passing codes. The Charon library (available in both C and Fortran) provides incremental parallelization capabilities by linking legacy code arrays with distributed arrays. During the conversion process, non-distributed and distributed arrays exist side by side, and simple mapping functions allow the programmer to switch between the two in any location in the program. Charon also provides wrapper functions that leave the structure of the legacy code intact, but that allow execution on truly distributed data. Finally, the library provides a rich set of communication functions that support virtually all patterns of remote data demands in realistic structured grid scientific programs, including transposition, nearest-neighbor communication, pipelining, gather/scatter, and redistribution. At the end of the conversion process most intermediate Charon function calls will have been removed, the non-distributed arrays will have been deleted, and virtually the only remaining Charon functions calls are the high-level, highly optimized communications. Distribution of the data is under complete control of the programmer, although a wide range of useful distributions is easily available through predefined functions. A crucial aspect of the library is that it does not allocate space for distributed arrays, but accepts programmer-specified memory. This has two major consequences. First, codes parallelized using Charon do not suffer from encapsulation; user data is always directly accessible. This provides high efficiency, and also retains the possibility of using message passing directly for highly irregular communications. Second, non-distributed arrays can be interpreted as (trivial) distributions in the Charon sense, which allows them to be mapped to truly distributed arrays, and vice versa. This is the mechanism that enables incremental parallelization. In this paper we provide a brief introduction of the library and then focus on the actual steps in the parallelization process, using some representative examples from, among others, the NAS Parallel Benchmarks. We show how a complicated two-dimensional pipeline-the prototypical non-data-parallel algorithm- can be constructed with ease. To demonstrate the flexibility of the library, we give examples of the stepwise, efficient parallel implementation of nonlocal boundary conditions common in aircraft simulations, as well as the construction of the sequence of grids required for multigrid.
Vascular system modeling in parallel environment - distributed and shared memory approaches

PubMed Central

Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne

2011-01-01

The paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages and therefore this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multi-core machines, show that both algorithms provide a significant speedup. PMID:21550891
Compiling global name-space programs for distributed execution

NASA Technical Reports Server (NTRS)

Koelbel, Charles; Mehrotra, Piyush

1990-01-01

Distributed memory machines do not provide hardware support for a global address space. Thus programmers are forced to partition the data across the memories of the architecture and use explicit message passing to communicate data between processors. The compiler support required to allow programmers to express their algorithms using a global name-space is examined. A general method is presented for analysis of a high level source program and along with its translation to a set of independently executing tasks communicating via messages. If the compiler has enough information, this translation can be carried out at compile-time. Otherwise run-time code is generated to implement the required data movement. The analysis required in both situations is described and the performance of the generated code on the Intel iPSC/2 is presented.
Performance and Application of Parallel OVERFLOW Codes on Distributed and Shared Memory Platforms

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Rizk, Yehia M.

1999-01-01

The presentation discusses recent studies on the performance of the two parallel versions of the aerodynamics CFD code, OVERFLOW_MPI and _MLP. Developed at NASA Ames, the serial version, OVERFLOW, is a multidimensional Navier-Stokes flow solver based on overset (Chimera) grid technology. The code has recently been parallelized in two ways. One is based on the explicit message-passing interface (MPI) across processors and uses the _MPI communication package. This approach is primarily suited for distributed memory systems and workstation clusters. The second, termed the multi-level parallel (MLP) method, is simple and uses shared memory for all communications. The _MLP code is suitable on distributed-shared memory systems. For both methods, the message passing takes place across the processors or processes at the advancement of each time step. This procedure is, in effect, the Chimera boundary conditions update, which is done in an explicit "Jacobi" style. In contrast, the update in the serial code is done in more of the "Gauss-Sidel" fashion. The programming efforts for the _MPI code is more complicated than for the _MLP code; the former requires modification of the outer and some inner shells of the serial code, whereas the latter focuses only on the outer shell of the code. The _MPI version offers a great deal of flexibility in distributing grid zones across a specified number of processors in order to achieve load balancing. The approach is capable of partitioning zones across multiple processors or sending each zone and/or cluster of several zones into a single processor. The message passing across the processors consists of Chimera boundary and/or an overlap of "halo" boundary points for each partitioned zone. The MLP version is a new coarse-grain parallel concept at the zonal and intra-zonal levels. A grouping strategy is used to distribute zones into several groups forming sub-processes which will run in parallel. The total volume of grid points in each group are approximately balanced. A proper number of threads are initially allocated to each group, and in subsequent iterations during the run-time, the number of threads are adjusted to achieve load balancing across the processes. Each process exploits the multitasking directives already established in Overflow.
Performance Evaluation of Remote Memory Access (RMA) Programming on Shared Memory Parallel Computers

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Jost, Gabriele; Biegel, Bryan A. (Technical Monitor)

2002-01-01

The purpose of this study is to evaluate the feasibility of remote memory access (RMA) programming on shared memory parallel computers. We discuss different RMA based implementations of selected CFD application benchmark kernels and compare them to corresponding message passing based codes. For the message-passing implementation we use MPI point-to-point and global communication routines. For the RMA based approach we consider two different libraries supporting this programming model. One is a shared memory parallelization library (SMPlib) developed at NASA Ames, the other is the MPI-2 extensions to the MPI Standard. We give timing comparisons for the different implementation strategies and discuss the performance.
Distributed-Memory Computing With the Langley Aerothermodynamic Upwind Relaxation Algorithm (LAURA)

NASA Technical Reports Server (NTRS)

Riley, Christopher J.; Cheatwood, F. McNeil

1997-01-01

The Langley Aerothermodynamic Upwind Relaxation Algorithm (LAURA), a Navier-Stokes solver, has been modified for use in a parallel, distributed-memory environment using the Message-Passing Interface (MPI) standard. A standard domain decomposition strategy is used in which the computational domain is divided into subdomains with each subdomain assigned to a processor. Performance is examined on dedicated parallel machines and a network of desktop workstations. The effect of domain decomposition and frequency of boundary updates on performance and convergence is also examined for several realistic configurations and conditions typical of large-scale computational fluid dynamic analysis.
New NAS Parallel Benchmarks Results

NASA Technical Reports Server (NTRS)

Yarrow, Maurice; Saphir, William; VanderWijngaart, Rob; Woo, Alex; Kutler, Paul (Technical Monitor)

1997-01-01

NPB2 (NAS (NASA Advanced Supercomputing) Parallel Benchmarks 2) is an implementation, based on Fortran and the MPI (message passing interface) message passing standard, of the original NAS Parallel Benchmark specifications. NPB2 programs are run with little or no tuning, in contrast to NPB vendor implementations, which are highly optimized for specific architectures. NPB2 results complement, rather than replace, NPB results. Because they have not been optimized by vendors, NPB2 implementations approximate the performance a typical user can expect for a portable parallel program on distributed memory parallel computers. Together these results provide an insightful comparison of the real-world performance of high-performance computers. New NPB2 features: New implementation (CG), new workstation class problem sizes, new serial sample versions, more performance statistics.
The Raid distributed database system

NASA Technical Reports Server (NTRS)

Bhargava, Bharat; Riedl, John

1989-01-01

Raid, a robust and adaptable distributed database system for transaction processing (TP), is described. Raid is a message-passing system, with server processes on each site to manage concurrent processing, consistent replicated copies during site failures, and atomic distributed commitment. A high-level layered communications package provides a clean location-independent interface between servers. The latest design of the package delivers messages via shared memory in a configuration with several servers linked into a single process. Raid provides the infrastructure to investigate various methods for supporting reliable distributed TP. Measurements on TP and server CPU time are presented, along with data from experiments on communications software, consistent replicated copy control during site failures, and concurrent distributed checkpointing. A software tool for evaluating the implementation of TP algorithms in an operating-system kernel is proposed.

Parallel and fault-tolerant algorithms for hypercube multiprocessors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aykanat, C.

1988-01-01

Several techniques for increasing the performance of parallel algorithms on distributed-memory message-passing multi-processor systems are investigated. These techniques are effectively implemented for the parallelization of the Scaled Conjugate Gradient (SCG) algorithm on a hypercube connected message-passing multi-processor. Significant performance improvement is achieved by using these techniques. The SCG algorithm is used for the solution phase of an FE modeling system. Almost linear speed-up is achieved, and it is shown that hypercube topology is scalable for an FE class of problem. The SCG algorithm is also shown to be suitable for vectorization, and near supercomputer performance is achieved on a vectormore » hypercube multiprocessor by exploiting both parallelization and vectorization. Fault-tolerance issues for the parallel SCG algorithm and for the hypercube topology are also addressed.« less
NAS Parallel Benchmark. Results 11-96: Performance Comparison of HPF and MPI Based NAS Parallel Benchmarks. 1.0

NASA Technical Reports Server (NTRS)

Saini, Subash; Bailey, David; Chancellor, Marisa K. (Technical Monitor)

1997-01-01

High Performance Fortran (HPF), the high-level language for parallel Fortran programming, is based on Fortran 90. HALF was defined by an informal standards committee known as the High Performance Fortran Forum (HPFF) in 1993, and modeled on TMC's CM Fortran language. Several HPF features have since been incorporated into the draft ANSI/ISO Fortran 95, the next formal revision of the Fortran standard. HPF allows users to write a single parallel program that can execute on a serial machine, a shared-memory parallel machine, or a distributed-memory parallel machine. HPF eliminates the complex, error-prone task of explicitly specifying how, where, and when to pass messages between processors on distributed-memory machines, or when to synchronize processors on shared-memory machines. HPF is designed in a way that allows the programmer to code an application at a high level, and then selectively optimize portions of the code by dropping into message-passing or calling tuned library routines as 'extrinsics'. Compilers supporting High Performance Fortran features first appeared in late 1994 and early 1995 from Applied Parallel Research (APR) Digital Equipment Corporation, and The Portland Group (PGI). IBM introduced an HPF compiler for the IBM RS/6000 SP/2 in April of 1996. Over the past two years, these implementations have shown steady improvement in terms of both features and performance. The performance of various hardware/ programming model (HPF and MPI (message passing interface)) combinations will be compared, based on latest NAS (NASA Advanced Supercomputing) Parallel Benchmark (NPB) results, thus providing a cross-machine and cross-model comparison. Specifically, HPF based NPB results will be compared with MPI based NPB results to provide perspective on performance currently obtainable using HPF versus MPI or versus hand-tuned implementations such as those supplied by the hardware vendors. In addition we would also present NPB (Version 1.0) performance results for the following systems: DEC Alpha Server 8400 5/440, Fujitsu VPP Series (VX, VPP300, and VPP700), HP/Convex Exemplar SPP2000, IBM RS/6000 SP P2SC node (120 MHz) NEC SX-4/32, SGI/CRAY T3E, SGI Origin2000.
GENASIS Basics: Object-oriented utilitarian functionality for large-scale physics simulations (Version 2)

NASA Astrophysics Data System (ADS)

Cardall, Christian Y.; Budiardja, Reuben D.

2017-05-01

GenASiS Basics provides Fortran 2003 classes furnishing extensible object-oriented utilitarian functionality for large-scale physics simulations on distributed memory supercomputers. This functionality includes physical units and constants; display to the screen or standard output device; message passing; I/O to disk; and runtime parameter management and usage statistics. This revision -Version 2 of Basics - makes mostly minor additions to functionality and includes some simplifying name changes.
Low Latency Messages on Distributed Memory Multiprocessors

DOE PAGES

Rosing, Matt; Saltz, Joel

1995-01-01

This article describes many of the issues in developing an efficient interface for communication on distributed memory machines. Although the hardware component of message latency is less than 1 ws on many distributed memory machines, the software latency associated with sending and receiving typed messages is on the order of 50 μs. The reason for this imbalance is that the software interface does not match the hardware. By changing the interface to match the hardware more closely, applications with fine grained communication can be put on these machines. This article describes several tests performed and many of the issues involvedmore » in supporting low latency messages on distributed memory machines.« less
A Multi-Level Parallelization Concept for High-Fidelity Multi-Block Solvers

NASA Technical Reports Server (NTRS)

Hatay, Ferhat F.; Jespersen, Dennis C.; Guruswamy, Guru P.; Rizk, Yehia M.; Byun, Chansup; Gee, Ken; VanDalsem, William R. (Technical Monitor)

1997-01-01

The integration of high-fidelity Computational Fluid Dynamics (CFD) analysis tools with the industrial design process benefits greatly from the robust implementations that are transportable across a wide range of computer architectures. In the present work, a hybrid domain-decomposition and parallelization concept was developed and implemented into the widely-used NASA multi-block Computational Fluid Dynamics (CFD) packages implemented in ENSAERO and OVERFLOW. The new parallel solver concept, PENS (Parallel Euler Navier-Stokes Solver), employs both fine and coarse granularity in data partitioning as well as data coalescing to obtain the desired load-balance characteristics on the available computer platforms. This multi-level parallelism implementation itself introduces no changes to the numerical results, hence the original fidelity of the packages are identically preserved. The present implementation uses the Message Passing Interface (MPI) library for interprocessor message passing and memory accessing. By choosing an appropriate combination of the available partitioning and coalescing capabilities only during the execution stage, the PENS solver becomes adaptable to different computer architectures from shared-memory to distributed-memory platforms with varying degrees of parallelism. The PENS implementation on the IBM SP2 distributed memory environment at the NASA Ames Research Center obtains 85 percent scalable parallel performance using fine-grain partitioning of single-block CFD domains using up to 128 wide computational nodes. Multi-block CFD simulations of complete aircraft simulations achieve 75 percent perfect load-balanced executions using data coalescing and the two levels of parallelism. SGI PowerChallenge, SGI Origin 2000, and a cluster of workstations are the other platforms where the robustness of the implementation is tested. The performance behavior on the other computer platforms with a variety of realistic problems will be included as this on-going study progresses.
Optimizing NEURON Simulation Environment Using Remote Memory Access with Recursive Doubling on Distributed Memory Systems.

PubMed

Shehzad, Danish; Bozkuş, Zeki

2016-01-01

Increase in complexity of neuronal network models escalated the efforts to make NEURON simulation environment efficient. The computational neuroscientists divided the equations into subnets amongst multiple processors for achieving better hardware performance. On parallel machines for neuronal networks, interprocessor spikes exchange consumes large section of overall simulation time. In NEURON for communication between processors Message Passing Interface (MPI) is used. MPI_Allgather collective is exercised for spikes exchange after each interval across distributed memory systems. The increase in number of processors though results in achieving concurrency and better performance but it inversely affects MPI_Allgather which increases communication time between processors. This necessitates improving communication methodology to decrease the spikes exchange time over distributed memory systems. This work has improved MPI_Allgather method using Remote Memory Access (RMA) by moving two-sided communication to one-sided communication, and use of recursive doubling mechanism facilitates achieving efficient communication between the processors in precise steps. This approach enhanced communication concurrency and has improved overall runtime making NEURON more efficient for simulation of large neuronal network models.
Optimizing NEURON Simulation Environment Using Remote Memory Access with Recursive Doubling on Distributed Memory Systems

PubMed Central

Bozkuş, Zeki

2016-01-01

Increase in complexity of neuronal network models escalated the efforts to make NEURON simulation environment efficient. The computational neuroscientists divided the equations into subnets amongst multiple processors for achieving better hardware performance. On parallel machines for neuronal networks, interprocessor spikes exchange consumes large section of overall simulation time. In NEURON for communication between processors Message Passing Interface (MPI) is used. MPI_Allgather collective is exercised for spikes exchange after each interval across distributed memory systems. The increase in number of processors though results in achieving concurrency and better performance but it inversely affects MPI_Allgather which increases communication time between processors. This necessitates improving communication methodology to decrease the spikes exchange time over distributed memory systems. This work has improved MPI_Allgather method using Remote Memory Access (RMA) by moving two-sided communication to one-sided communication, and use of recursive doubling mechanism facilitates achieving efficient communication between the processors in precise steps. This approach enhanced communication concurrency and has improved overall runtime making NEURON more efficient for simulation of large neuronal network models. PMID:27413363
Experiences using OpenMP based on Computer Directed Software DSM on a PC Cluster

NASA Technical Reports Server (NTRS)

Hess, Matthias; Jost, Gabriele; Mueller, Matthias; Ruehle, Roland

2003-01-01

In this work we report on our experiences running OpenMP programs on a commodity cluster of PCs running a software distributed shared memory (DSM) system. We describe our test environment and report on the performance of a subset of the NAS Parallel Benchmarks that have been automaticaly parallelized for OpenMP. We compare the performance of the OpenMP implementations with that of their message passing counterparts and discuss performance differences.
Enhancing Application Performance Using Mini-Apps: Comparison of Hybrid Parallel Programming Paradigms

NASA Technical Reports Server (NTRS)

Lawson, Gary; Poteat, Michael; Sosonkina, Masha; Baurle, Robert; Hammond, Dana

2016-01-01

In this work, several mini-apps have been created to enhance a real-world application performance, namely the VULCAN code for complex flow analysis developed at the NASA Langley Research Center. These mini-apps explore hybrid parallel programming paradigms with Message Passing Interface (MPI) for distributed memory access and either Shared MPI (SMPI) or OpenMP for shared memory accesses. Performance testing shows that MPI+SMPI yields the best execution performance, while requiring the largest number of code changes. A maximum speedup of 23X was measured for MPI+SMPI, but only 10X was measured for MPI+OpenMP.
DMA engine for repeating communication patterns

DOEpatents

Chen, Dong; Gara, Alan G.; Giampapa, Mark E.; Heidelberger, Philip; Steinmacher-Burow, Burkhard; Vranas, Pavlos

2010-09-21

A parallel computer system is constructed as a network of interconnected compute nodes to operate a global message-passing application for performing communications across the network. Each of the compute nodes includes one or more individual processors with memories which run local instances of the global message-passing application operating at each compute node to carry out local processing operations independent of processing operations carried out at other compute nodes. Each compute node also includes a DMA engine constructed to interact with the application via Injection FIFO Metadata describing multiple Injection FIFOs where each Injection FIFO may containing an arbitrary number of message descriptors in order to process messages with a fixed processing overhead irrespective of the number of message descriptors included in the Injection FIFO.
A multiarchitecture parallel-processing development environment

NASA Technical Reports Server (NTRS)

Townsend, Scott; Blech, Richard; Cole, Gary

1993-01-01

A description is given of the hardware and software of a multiprocessor test bed - the second generation Hypercluster system. The Hypercluster architecture consists of a standard hypercube distributed-memory topology, with multiprocessor shared-memory nodes. By using standard, off-the-shelf hardware, the system can be upgraded to use rapidly improving computer technology. The Hypercluster's multiarchitecture nature makes it suitable for researching parallel algorithms in computational field simulation applications (e.g., computational fluid dynamics). The dedicated test-bed environment of the Hypercluster and its custom-built software allows experiments with various parallel-processing concepts such as message passing algorithms, debugging tools, and computational 'steering'. Such research would be difficult, if not impossible, to achieve on shared, commercial systems.
Verification of Faulty Message Passing Systems with Continuous State Space in PVS

NASA Technical Reports Server (NTRS)

Pilotto, Concetta; White, Jerome

2010-01-01

We present a library of Prototype Verification System (PVS) meta-theories that verifies a class of distributed systems in which agent commu nication is through message-passing. The theoretic work, outlined in, consists of iterative schemes for solving systems of linear equations , such as message-passing extensions of the Gauss and Gauss-Seidel me thods. We briefly review that work and discuss the challenges in formally verifying it.
Efficient iteration in data-parallel programs with irregular and dynamically distributed data structures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Littlefield, R.J.

1990-02-01

To implement an efficient data-parallel program on a non-shared memory MIMD multicomputer, data and computations must be properly partitioned to achieve good load balance and locality of reference. Programs with irregular data reference patterns often require irregular partitions. Although good partitions may be easy to determine, they can be difficult or impossible to implement in programming languages that provide only regular data distributions, such as blocked or cyclic arrays. We are developing Onyx, a programming system that provides a shared memory model of distributed data structures and extends the concept of data distribution to include irregular and dynamic distributions. Thismore » provides a powerful means to specify irregular partitions. Perhaps surprisingly, programs using it can also execute efficiently. In this paper, we describe and evaluate the Onyx implementation of a model problem that repeatedly executes an irregular but fixed data reference pattern. On an NCUBE hypercube, the speed of the Onyx implementation is comparable to that of carefully handwritten message-passing code.« less
Streaming data analytics via message passing with application to graph algorithms

DOE PAGES

Plimpton, Steven J.; Shead, Tim

2014-05-06

The need to process streaming data, which arrives continuously at high-volume in real-time, arises in a variety of contexts including data produced by experiments, collections of environmental or network sensors, and running simulations. Streaming data can also be formulated as queries or transactions which operate on a large dynamic data store, e.g. a distributed database. We describe a lightweight, portable framework named PHISH which enables a set of independent processes to compute on a stream of data in a distributed-memory parallel manner. Datums are routed between processes in patterns defined by the application. PHISH can run on top of eithermore » message-passing via MPI or sockets via ZMQ. The former means streaming computations can be run on any parallel machine which supports MPI; the latter allows them to run on a heterogeneous, geographically dispersed network of machines. We illustrate how PHISH can support streaming MapReduce operations, and describe streaming versions of three algorithms for large, sparse graph analytics: triangle enumeration, subgraph isomorphism matching, and connected component finding. Lastly, we also provide benchmark timings for MPI versus socket performance of several kernel operations useful in streaming algorithms.« less
Parallel, Distributed Scripting with Python

DOE Office of Scientific and Technical Information (OSTI.GOV)

Miller, P J

2002-05-24

Parallel computers used to be, for the most part, one-of-a-kind systems which were extremely difficult to program portably. With SMP architectures, the advent of the POSIX thread API and OpenMP gave developers ways to portably exploit on-the-box shared memory parallelism. Since these architectures didn't scale cost-effectively, distributed memory clusters were developed. The associated MPI message passing libraries gave these systems a portable paradigm too. Having programmers effectively use this paradigm is a somewhat different question. Distributed data has to be explicitly transported via the messaging system in order for it to be useful. In high level languages, the MPI librarymore » gives access to data distribution routines in C, C++, and FORTRAN. But we need more than that. Many reasonable and common tasks are best done in (or as extensions to) scripting languages. Consider sysadm tools such as password crackers, file purgers, etc ... These are simple to write in a scripting language such as Python (an open source, portable, and freely available interpreter). But these tasks beg to be done in parallel. Consider the a password checker that checks an encrypted password against a 25,000 word dictionary. This can take around 10 seconds in Python (6 seconds in C). It is trivial to parallelize if you can distribute the information and co-ordinate the work.« less
Distributed parallel messaging for multiprocessor systems

DOEpatents

Chen, Dong; Heidelberger, Philip; Salapura, Valentina; Senger, Robert M; Steinmacher-Burrow, Burhard; Sugawara, Yutaka

2013-06-04

A method and apparatus for distributed parallel messaging in a parallel computing system. The apparatus includes, at each node of a multiprocessor network, multiple injection messaging engine units and reception messaging engine units, each implementing a DMA engine and each supporting both multiple packet injection into and multiple reception from a network, in parallel. The reception side of the messaging unit (MU) includes a switch interface enabling writing of data of a packet received from the network to the memory system. The transmission side of the messaging unit, includes switch interface for reading from the memory system when injecting packets into the network.
Communication Studies of DMP and SMP Machines

NASA Technical Reports Server (NTRS)

Sohn, Andrew; Biswas, Rupak; Chancellor, Marisa K. (Technical Monitor)

1997-01-01

Understanding the interplay between machines and problems is key to obtaining high performance on parallel machines. This paper investigates the interplay between programming paradigms and communication capabilities of parallel machines. In particular, we explicate the communication capabilities of the IBM SP-2 distributed-memory multiprocessor and the SGI PowerCHALLENGEarray symmetric multiprocessor. Two benchmark problems of bitonic sorting and Fast Fourier Transform are selected for experiments. Communication-efficient algorithms are developed to exploit the overlapping capabilities of the machines. Programs are written in Message-Passing Interface for portability and identical codes are used for both machines. Various data sizes and message sizes are used to test the machines' communication capabilities. Experimental results indicate that the communication performance of the multiprocessors are consistent with the size of messages. The SP-2 is sensitive to message size but yields a much higher communication overlapping because of the communication co-processor. The PowerCHALLENGEarray is not highly sensitive to message size and yields a low communication overlapping. Bitonic sorting yields lower performance compared to FFT due to a smaller computation-to-communication ratio.
Efficiently passing messages in distributed spiking neural network simulation.

PubMed

Thibeault, Corey M; Minkovich, Kirill; O'Brien, Michael J; Harris, Frederick C; Srinivasa, Narayan

2013-01-01

Efficiently passing spiking messages in a neural model is an important aspect of high-performance simulation. As the scale of networks has increased so has the size of the computing systems required to simulate them. In addition, the information exchange of these resources has become more of an impediment to performance. In this paper we explore spike message passing using different mechanisms provided by the Message Passing Interface (MPI). A specific implementation, MVAPICH, designed for high-performance clusters with Infiniband hardware is employed. The focus is on providing information about these mechanisms for users of commodity high-performance spiking simulators. In addition, a novel hybrid method for spike exchange was implemented and benchmarked.
Parallelization of NAS Benchmarks for Shared Memory Multiprocessors

NASA Technical Reports Server (NTRS)

Waheed, Abdul; Yan, Jerry C.; Saini, Subhash (Technical Monitor)

1998-01-01

This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of porting to new generations of high performance computing systems to parallelization tools and compilers. Due to the simplicity of programming shared-memory multiprocessors, compiler developers have provided various facilities to allow the users to exploit parallelism. Native compilers on SGI Origin2000 support multiprocessing directives to allow users to exploit loop-level parallelism in their programs. Additionally, supporting tools can accomplish this process automatically and present the results of parallelization to the users. We experimented with these compiler directives and supporting tools by parallelizing sequential implementation of NAS benchmarks. Results reported in this paper indicate that with minimal effort, the performance gain is comparable with the hand-parallelized, carefully optimized, message-passing implementations of the same benchmarks.
Parallel Navier-Stokes computations on shared and distributed memory architectures

NASA Technical Reports Server (NTRS)

Hayder, M. Ehtesham; Jayasimha, D. N.; Pillay, Sasi Kumar

1995-01-01

We study a high order finite difference scheme to solve the time accurate flow field of a jet using the compressible Navier-Stokes equations. As part of our ongoing efforts, we have implemented our numerical model on three parallel computing platforms to study the computational, communication, and scalability characteristics. The platforms chosen for this study are a cluster of workstations connected through fast networks (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and a distributed memory multiprocessor (the IBM SPI). Our focus in this study is on the LACE testbed. We present some results for the Cray YMP and the IBM SP1 mainly for comparison purposes. On the LACE testbed, we study: (1) the communication characteristics of Ethernet, FDDI, and the ALLNODE networks and (2) the overheads induced by the PVM message passing library used for parallelizing the application. We demonstrate that clustering of workstations is effective and has the potential to be computationally competitive with supercomputers at a fraction of the cost.

SKIRT: Hybrid parallelization of radiative transfer simulations

NASA Astrophysics Data System (ADS)

Verstocken, S.; Van De Putte, D.; Camps, P.; Baes, M.

2017-07-01

We describe the design, implementation and performance of the new hybrid parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which has been used extensively for modelling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori. The hybrid scheme combines distributed memory parallelization, using the standard Message Passing Interface (MPI) to communicate between processes, and shared memory parallelization, providing multiple execution threads within each process to avoid duplication of data structures. The synchronization between multiple threads is accomplished through atomic operations without high-level locking (also called lock-free programming). This improves the scaling behaviour of the code and substantially simplifies the implementation of the hybrid scheme. The result is an extremely flexible solution that adjusts to the number of available nodes, processors and memory, and consequently performs well on a wide variety of computing architectures.
Performance Measurement, Visualization and Modeling of Parallel and Distributed Programs

NASA Technical Reports Server (NTRS)

Yan, Jerry C.; Sarukkai, Sekhar R.; Mehra, Pankaj; Lum, Henry, Jr. (Technical Monitor)

1994-01-01

This paper presents a methodology for debugging the performance of message-passing programs on both tightly coupled and loosely coupled distributed-memory machines. The AIMS (Automated Instrumentation and Monitoring System) toolkit, a suite of software tools for measurement and analysis of performance, is introduced and its application illustrated using several benchmark programs drawn from the field of computational fluid dynamics. AIMS includes (i) Xinstrument, a powerful source-code instrumentor, which supports both Fortran77 and C as well as a number of different message-passing libraries including Intel's NX Thinking Machines' CMMD, and PVM; (ii) Monitor, a library of timestamping and trace -collection routines that run on supercomputers (such as Intel's iPSC/860, Delta, and Paragon and Thinking Machines' CM5) as well as on networks of workstations (including Convex Cluster and SparcStations connected by a LAN); (iii) Visualization Kernel, a trace-animation facility that supports source-code clickback, simultaneous visualization of computation and communication patterns, as well as analysis of data movements; (iv) Statistics Kernel, an advanced profiling facility, that associates a variety of performance data with various syntactic components of a parallel program; (v) Index Kernel, a diagnostic tool that helps pinpoint performance bottlenecks through the use of abstract indices; (vi) Modeling Kernel, a facility for automated modeling of message-passing programs that supports both simulation -based and analytical approaches to performance prediction and scalability analysis; (vii) Intrusion Compensator, a utility for recovering true performance from observed performance by removing the overheads of monitoring and their effects on the communication pattern of the program; and (viii) Compatibility Tools, that convert AIMS-generated traces into formats used by other performance-visualization tools, such as ParaGraph, Pablo, and certain AVS/Explorer modules.
Simple Algorithms for Distributed Leader Election in Anonymous Synchronous Rings and Complete Networks Inspired by Neural Development in Fruit Flies.

PubMed

Xu, Lei; Jeavons, Peter

2015-11-01

Leader election in anonymous rings and complete networks is a very practical problem in distributed computing. Previous algorithms for this problem are generally designed for a classical message passing model where complex messages are exchanged. However, the need to send and receive complex messages makes such algorithms less practical for some real applications. We present some simple synchronous algorithms for distributed leader election in anonymous rings and complete networks that are inspired by the development of the neural system of the fruit fly. Our leader election algorithms all assume that only one-bit messages are broadcast by nodes in the network and processors are only able to distinguish between silence and the arrival of one or more messages. These restrictions allow implementations to use a simpler message-passing architecture. Even with these harsh restrictions our algorithms are shown to achieve good time and message complexity both analytically and experimentally.
[Series: Medical Applications of the PHITS Code (2): Acceleration by Parallel Computing].

PubMed

Furuta, Takuya; Sato, Tatsuhiko

2015-01-01

Time-consuming Monte Carlo dose calculation becomes feasible owing to the development of computer technology. However, the recent development is due to emergence of the multi-core high performance computers. Therefore, parallel computing becomes a key to achieve good performance of software programs. A Monte Carlo simulation code PHITS contains two parallel computing functions, the distributed-memory parallelization using protocols of message passing interface (MPI) and the shared-memory parallelization using open multi-processing (OpenMP) directives. Users can choose the two functions according to their needs. This paper gives the explanation of the two functions with their advantages and disadvantages. Some test applications are also provided to show their performance using a typical multi-core high performance workstation.
High Performance Programming Using Explicit Shared Memory Model on Cray T3D1

NASA Technical Reports Server (NTRS)

Simon, Horst D.; Saini, Subhash; Grassi, Charles

1994-01-01

The Cray T3D system is the first-phase system in Cray Research, Inc.'s (CRI) three-phase massively parallel processing (MPP) program. This system features a heterogeneous architecture that closely couples DEC's Alpha microprocessors and CRI's parallel-vector technology, i.e., the Cray Y-MP and Cray C90. An overview of the Cray T3D hardware and available programming models is presented. Under Cray Research adaptive Fortran (CRAFT) model four programming methods (data parallel, work sharing, message-passing using PVM, and explicit shared memory model) are available to the users. However, at this time data parallel and work sharing programming models are not available to the user community. The differences between standard PVM and CRI's PVM are highlighted with performance measurements such as latencies and communication bandwidths. We have found that the performance of neither standard PVM nor CRI s PVM exploits the hardware capabilities of the T3D. The reasons for the bad performance of PVM as a native message-passing library are presented. This is illustrated by the performance of NAS Parallel Benchmarks (NPB) programmed in explicit shared memory model on Cray T3D. In general, the performance of standard PVM is about 4 to 5 times less than obtained by using explicit shared memory model. This degradation in performance is also seen on CM-5 where the performance of applications using native message-passing library CMMD on CM-5 is also about 4 to 5 times less than using data parallel methods. The issues involved (such as barriers, synchronization, invalidating data cache, aligning data cache etc.) while programming in explicit shared memory model are discussed. Comparative performance of NPB using explicit shared memory programming model on the Cray T3D and other highly parallel systems such as the TMC CM-5, Intel Paragon, Cray C90, IBM-SP1, etc. is presented.
Charon Message-Passing Toolkit for Scientific Computations

NASA Technical Reports Server (NTRS)

VanderWijngaart, Rob F.; Yan, Jerry (Technical Monitor)

2000-01-01

Charon is a library, callable from C and Fortran, that aids the conversion of structured-grid legacy codes-such as those used in the numerical computation of fluid flows-into parallel, high- performance codes. Key are functions that define distributed arrays, that map between distributed and non-distributed arrays, and that allow easy specification of common communications on structured grids. The library is based on the widely accepted MPI message passing standard. We present an overview of the functionality of Charon, and some representative results.
A real-time MPEG software decoder using a portable message-passing library

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kwong, Man Kam; Tang, P.T. Peter; Lin, Biquan

1995-12-31

We present a real-time MPEG software decoder that uses message-passing libraries such as MPL, p4 and MPI. The parallel MPEG decoder currently runs on the IBM SP system but can be easil ported to other parallel machines. This paper discusses our parallel MPEG decoding algorithm as well as the parallel programming environment under which it uses. Several technical issues are discussed, including balancing of decoding speed, memory limitation, 1/0 capacities, and optimization of MPEG decoding components. This project shows that a real-time portable software MPEG decoder is feasible in a general-purpose parallel machine.
Message Passing vs. Shared Address Space on a Cluster of SMPs

NASA Technical Reports Server (NTRS)

Shan, Hongzhang; Singh, Jaswinder Pal; Oliker, Leonid; Biswas, Rupak

2000-01-01

The convergence of scalable computer architectures using clusters of PCs (or PC-SMPs) with commodity networking has become an attractive platform for high end scientific computing. Currently, message-passing and shared address space (SAS) are the two leading programming paradigms for these systems. Message-passing has been standardized with MPI, and is the most common and mature programming approach. However message-passing code development can be extremely difficult, especially for irregular structured computations. SAS offers substantial ease of programming, but may suffer from performance limitations due to poor spatial locality, and high protocol overhead. In this paper, we compare the performance of and programming effort, required for six applications under both programming models on a 32 CPU PC-SMP cluster. Our application suite consists of codes that typically do not exhibit high efficiency under shared memory programming. due to their high communication to computation ratios and complex communication patterns. Results indicate that SAS can achieve about half the parallel efficiency of MPI for most of our applications: however, on certain classes of problems SAS performance is competitive with MPI. We also present new algorithms for improving the PC cluster performance of MPI collective operations.
An Evaluation of Architectural Platforms for Parallel Navier-Stokes Computations

NASA Technical Reports Server (NTRS)

Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.

1996-01-01

We study the computational, communication, and scalability characteristics of a computational fluid dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architecture platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and distributed memory multiprocessors with different topologies - the IBM SP and the Cray T3D. We investigate the impact of various networks connecting the cluster of workstations on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms

NASA Technical Reports Server (NTRS)

Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.

1997-01-01

We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies-the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
HYDRA : High-speed simulation architecture for precision spacecraft formation simulation

NASA Technical Reports Server (NTRS)

Martin, Bryan J.; Sohl, Garett.

2003-01-01

e Hierarchical Distributed Reconfigurable Architecture- is a scalable simulation architecture that provides flexibility and ease-of-use which take advantage of modern computation and communication hardware. It also provides the ability to implement distributed - or workstation - based simulations and high-fidelity real-time simulation from a common core. Originally designed to serve as a research platform for examining fundamental challenges in formation flying simulation for future space missions, it is also finding use in other missions and applications, all of which can take advantage of the underlying Object-Oriented structure to easily produce distributed simulations. Hydra automates the process of connecting disparate simulation components (Hydra Clients) through a client server architecture that uses high-level descriptions of data associated with each client to find and forge desirable connections (Hydra Services) at run time. Services communicate through the use of Connectors, which abstract messaging to provide single-interface access to any desired communication protocol, such as from shared-memory message passing to TCP/IP to ACE and COBRA. Hydra shares many features with the HLA, although providing more flexibility in connectivity services and behavior overriding.
PRAIS: Distributed, real-time knowledge-based systems made easy

NASA Technical Reports Server (NTRS)

Goldstein, David G.

1990-01-01

This paper discusses an architecture for real-time, distributed (parallel) knowledge-based systems called the Parallel Real-time Artificial Intelligence System (PRAIS). PRAIS strives for transparently parallelizing production (rule-based) systems, even when under real-time constraints. PRAIS accomplishes these goals by incorporating a dynamic task scheduler, operating system extensions for fact handling, and message-passing among multiple copies of CLIPS executing on a virtual blackboard. This distributed knowledge-based system tool uses the portability of CLIPS and common message-passing protocols to operate over a heterogeneous network of processors.
Parallel definition of tear film maps on distributed-memory clusters for the support of dry eye diagnosis.

PubMed

González-Domínguez, Jorge; Remeseiro, Beatriz; Martín, María J

2017-02-01

The analysis of the interference patterns on the tear film lipid layer is a useful clinical test to diagnose dry eye syndrome. This task can be automated with a high degree of accuracy by means of the use of tear film maps. However, the time required by the existing applications to generate them prevents a wider acceptance of this method by medical experts. Multithreading has been previously successfully employed by the authors to accelerate the tear film map definition on multicore single-node machines. In this work, we propose a hybrid message-passing and multithreading parallel approach that further accelerates the generation of tear film maps by exploiting the computational capabilities of distributed-memory systems such as multicore clusters and supercomputers. The algorithm for drawing tear film maps is parallelized using Message Passing Interface (MPI) for inter-node communications and the multithreading support available in the C++11 standard for intra-node parallelization. The original algorithm is modified to reduce the communications and increase the scalability. The hybrid method has been tested on 32 nodes of an Intel cluster (with two 12-core Haswell 2680v3 processors per node) using 50 representative images. Results show that maximum runtime is reduced from almost two minutes using the previous only-multithreaded approach to less than ten seconds using the hybrid method. The hybrid MPI/multithreaded implementation can be used by medical experts to obtain tear film maps in only a few seconds, which will significantly accelerate and facilitate the diagnosis of the dry eye syndrome. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Preconditioned implicit solvers for the Navier-Stokes equations on distributed-memory machines

NASA Technical Reports Server (NTRS)

Ajmani, Kumud; Liou, Meng-Sing; Dyson, Rodger W.

1994-01-01

The GMRES method is parallelized, and combined with local preconditioning to construct an implicit parallel solver to obtain steady-state solutions for the Navier-Stokes equations of fluid flow on distributed-memory machines. The new implicit parallel solver is designed to preserve the convergence rate of the equivalent 'serial' solver. A static domain-decomposition is used to partition the computational domain amongst the available processing nodes of the parallel machine. The SPMD (Single-Program Multiple-Data) programming model is combined with message-passing tools to develop the parallel code on a 32-node Intel Hypercube and a 512-node Intel Delta machine. The implicit parallel solver is validated for internal and external flow problems, and is found to compare identically with flow solutions obtained on a Cray Y-MP/8. A peak computational speed of 2300 MFlops/sec has been achieved on 512 nodes of the Intel Delta machine,k for a problem size of 1024 K equations (256 K grid points).
Testing New Programming Paradigms with NAS Parallel Benchmarks

NASA Technical Reports Server (NTRS)

Jin, H.; Frumkin, M.; Schultz, M.; Yan, J.

2000-01-01

Over the past decade, high performance computing has evolved rapidly, not only in hardware architectures but also with increasing complexity of real applications. Technologies have been developing to aim at scaling up to thousands of processors on both distributed and shared memory systems. Development of parallel programs on these computers is always a challenging task. Today, writing parallel programs with message passing (e.g. MPI) is the most popular way of achieving scalability and high performance. However, writing message passing programs is difficult and error prone. Recent years new effort has been made in defining new parallel programming paradigms. The best examples are: HPF (based on data parallelism) and OpenMP (based on shared memory parallelism). Both provide simple and clear extensions to sequential programs, thus greatly simplify the tedious tasks encountered in writing message passing programs. HPF is independent of memory hierarchy, however, due to the immaturity of compiler technology its performance is still questionable. Although use of parallel compiler directives is not new, OpenMP offers a portable solution in the shared-memory domain. Another important development involves the tremendous progress in the internet and its associated technology. Although still in its infancy, Java promisses portability in a heterogeneous environment and offers possibility to "compile once and run anywhere." In light of testing these new technologies, we implemented new parallel versions of the NAS Parallel Benchmarks (NPBs) with HPF and OpenMP directives, and extended the work with Java and Java-threads. The purpose of this study is to examine the effectiveness of alternative programming paradigms. NPBs consist of five kernels and three simulated applications that mimic the computation and data movement of large scale computational fluid dynamics (CFD) applications. We started with the serial version included in NPB2.3. Optimization of memory and cache usage was applied to several benchmarks, noticeably BT and SP, resulting in better sequential performance. In order to overcome the lack of an HPF performance model and guide the development of the HPF codes, we employed an empirical performance model for several primitives found in the benchmarks. We encountered a few limitations of HPF, such as lack of supporting the "REDISTRIBUTION" directive and no easy way to handle irregular computation. The parallelization with OpenMP directives was done at the outer-most loop level to achieve the largest granularity. The performance of six HPF and OpenMP benchmarks is compared with their MPI counterparts for the Class-A problem size in the figure in next page. These results were obtained on an SGI Origin2000 (195MHz) with MIPSpro-f77 compiler 7.2.1 for OpenMP and MPI codes and PGI pghpf-2.4.3 compiler with MPI interface for HPF programs.
Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Li, Xiaoye; Husbands, Parry; Biswas, Rupak; Biegel, Bryan (Technical Monitor)

2002-01-01

The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. For systems that are ill-conditioned, it is often necessary to use a preconditioning technique. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and ILU(O) preconditioned CG (PCG) using different programming paradigms and architectures. Results show that for this class of applications: ordering significantly improves overall performance on both distributed and distributed shared-memory systems, that cache reuse may be more important than reducing communication, that it is possible to achieve message-passing performance using shared-memory constructs through careful data ordering and distribution, and that a hybrid MPI+OpenMP paradigm increases programming complexity with little performance gains. A implementation of CG on the Cray MTA does not require special ordering or partitioning to obtain high efficiency and scalability, giving it a distinct advantage for adaptive applications; however, it shows limited scalability for PCG due to a lack of thread level parallelism.
Low latency messages on distributed memory multiprocessors

NASA Technical Reports Server (NTRS)

Rosing, Matthew; Saltz, Joel

1993-01-01

Many of the issues in developing an efficient interface for communication on distributed memory machines are described and a portable interface is proposed. Although the hardware component of message latency is less than one microsecond on many distributed memory machines, the software latency associated with sending and receiving typed messages is on the order of 50 microseconds. The reason for this imbalance is that the software interface does not match the hardware. By changing the interface to match the hardware more closely, applications with fine grained communication can be put on these machines. Based on several tests that were run on the iPSC/860, an interface that will better match current distributed memory machines is proposed. The model used in the proposed interface consists of a computation processor and a communication processor on each node. Communication between these processors and other nodes in the system is done through a buffered network. Information that is transmitted is either data or procedures to be executed on the remote processor. The dual processor system is better suited for efficiently handling asynchronous communications compared to a single processor system. The ability to send data or procedure is very flexible for minimizing message latency, based on the type of communication being performed. The test performed and the proposed interface are described.
PANDA: A distributed multiprocessor operating system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chubb, P.

1989-01-01

PANDA is a design for a distributed multiprocessor and an operating system. PANDA is designed to allow easy expansion of both hardware and software. As such, the PANDA kernel provides only message passing and memory and process management. The other features needed for the system (device drivers, secondary storage management, etc.) are provided as replaceable user tasks. The thesis presents PANDA's design and implementation, both hardware and software. PANDA uses multiple 68010 processors sharing memory on a VME bus, each such node potentially connected to others via a high speed network. The machine is completely homogeneous: there are no differencesmore » between processors that are detectable by programs running on the machine. A single two-processor node has been constructed. Each processor contains memory management circuits designed to allow processors to share page tables safely. PANDA presents a programmers' model similar to the hardware model: a job is divided into multiple tasks, each having its own address space. Within each task, multiple processes share code and data. Tasks can send messages to each other, and set up virtual circuits between themselves. Peripheral devices such as disc drives are represented within PANDA by tasks. PANDA divides secondary storage into volumes, each volume being accessed by a volume access task, or VAT. All knowledge about the way that data is stored on a disc is kept in its volume's VAT. The design is such that PANDA should provide a useful testbed for file systems and device drivers, as these can be installed without recompiling PANDA itself, and without rebooting the machine.« less
Dust Dynamics in Protoplanetary Disks: Parallel Computing with PVM

NASA Astrophysics Data System (ADS)

de La Fuente Marcos, Carlos; Barge, Pierre; de La Fuente Marcos, Raúl

2002-03-01

We describe a parallel version of our high-order-accuracy particle-mesh code for the simulation of collisionless protoplanetary disks. We use this code to carry out a massively parallel, two-dimensional, time-dependent, numerical simulation, which includes dust particles, to study the potential role of large-scale, gaseous vortices in protoplanetary disks. This noncollisional problem is easy to parallelize on message-passing multicomputer architectures. We performed the simulations on a cache-coherent nonuniform memory access Origin 2000 machine, using both the parallel virtual machine (PVM) and message-passing interface (MPI) message-passing libraries. Our performance analysis suggests that, for our problem, PVM is about 25% faster than MPI. Using PVM and MPI made it possible to reduce CPU time and increase code performance. This allows for simulations with a large number of particles (N ~ 105-106) in reasonable CPU times. The performances of our implementation of the pa! rallel code on an Origin 2000 supercomputer are presented and discussed. They exhibit very good speedup behavior and low load unbalancing. Our results confirm that giant gaseous vortices can play a dominant role in giant planet formation.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Sayan Ghosh, Jeff Hammond

OpenSHMEM is a community effort to unifyt and standardize the SHMEM programming model. MPI (Message Passing Interface) is a well-known community standard for parallel programming using distributed memory. The most recen t release of MPI, version 3.0, was designed in part to support programming models like SHMEM.OSHMPI is an implementation of the OpenSHMEM standard using MPI-3 for the Linux operating system. It is the first implementation of SHMEM over MPI one-sided communication and has the potential to be widely adopted due to the portability and widely availability of Linux and MPI-3. OSHMPI has been tested on a variety of systemsmore » and implementations of MPI-3, includingInfiniBand clusters using MVAPICH2 and SGI shared-memory supercomputers using MPICH. Current support is limited to Linux but may be extended to Apple OSX if there is sufficient interest. The code is opensource via https://github.com/jeffhammond/oshmpi« less

Distributed computing for membrane-based modeling of action potential propagation.

PubMed

Porras, D; Rogers, J M; Smith, W M; Pollard, A E

2000-08-01

Action potential propagation simulations with physiologic membrane currents and macroscopic tissue dimensions are computationally expensive. We, therefore, analyzed distributed computing schemes to reduce execution time in workstation clusters by parallelizing solutions with message passing. Four schemes were considered in two-dimensional monodomain simulations with the Beeler-Reuter membrane equations. Parallel speedups measured with each scheme were compared to theoretical speedups, recognizing the relationship between speedup and code portions that executed serially. A data decomposition scheme based on total ionic current provided the best performance. Analysis of communication latencies in that scheme led to a load-balancing algorithm in which measured speedups at 89 +/- 2% and 75 +/- 8% of theoretical speedups were achieved in homogeneous and heterogeneous clusters of workstations. Speedups in this scheme with the Luo-Rudy dynamic membrane equations exceeded 3.0 with eight distributed workstations. Cluster speedups were comparable to those measured during parallel execution on a shared memory machine.
User-Defined Data Distributions in High-Level Programming Languages

NASA Technical Reports Server (NTRS)

Diaconescu, Roxana E.; Zima, Hans P.

2006-01-01

One of the characteristic features of today s high performance computing systems is a physically distributed memory. Efficient management of locality is essential for meeting key performance requirements for these architectures. The standard technique for dealing with this issue has involved the extension of traditional sequential programming languages with explicit message passing, in the context of a processor-centric view of parallel computation. This has resulted in complex and error-prone assembly-style codes in which algorithms and communication are inextricably interwoven. This paper presents a high-level approach to the design and implementation of data distributions. Our work is motivated by the need to improve the current parallel programming methodology by introducing a paradigm supporting the development of efficient and reusable parallel code. This approach is currently being implemented in the context of a new programming language called Chapel, which is designed in the HPCS project Cascade.
Nonlinear Fluid Computations in a Distributed Environment

NASA Technical Reports Server (NTRS)

Atwood, Christopher A.; Smith, Merritt H.

1995-01-01

The performance of a loosely and tightly-coupled workstation cluster is compared against a conventional vector supercomputer for the solution the Reynolds- averaged Navier-Stokes equations. The application geometries include a transonic airfoil, a tiltrotor wing/fuselage, and a wing/body/empennage/nacelle transport. Decomposition is of the manager-worker type, with solution of one grid zone per worker process coupled using the PVM message passing library. Task allocation is determined by grid size and processor speed, subject to available memory penalties. Each fluid zone is computed using an implicit diagonal scheme in an overset mesh framework, while relative body motion is accomplished using an additional worker process to re-establish grid communication.
Software Mechanisms for Multiprocessor TLB Consistency

DTIC Science & Technology

1989-12-01

thanks. Raj V aswani implemented the DASH message-passing system. Ramesh Govindan implemented part of the DASH virtual memory system. G . Scott...1 Latency (ma) 5𔃺 ---·-r··-·r···r··-T·-·-r··-·r····r····-r··-·r··- . g ~~~R=r 18 -----1·--·-r··-T·----1·--··r·---1·----1·--··r·---T·----1 16...model development. Synchronizing TLBs is similar to updating replicated data in a distributed environment. Lee and Garcia-Molina both used an M/ G /1
Porting Gravitational Wave Signal Extraction to Parallel Virtual Machine (PVM)

NASA Technical Reports Server (NTRS)

Thirumalainambi, Rajkumar; Thompson, David E.; Redmon, Jeffery

2009-01-01

Laser Interferometer Space Antenna (LISA) is a planned NASA-ESA mission to be launched around 2012. The Gravitational Wave detection is fundamentally the determination of frequency, source parameters, and waveform amplitude derived in a specific order from the interferometric time-series of the rotating LISA spacecrafts. The LISA Science Team has developed a Mock LISA Data Challenge intended to promote the testing of complicated nested search algorithms to detect the 100-1 millihertz frequency signals at amplitudes of 10E-21. However, it has become clear that, sequential search of the parameters is very time consuming and ultra-sensitive; hence, a new strategy has been developed. Parallelization of existing sequential search algorithms of Gravitational Wave signal identification consists of decomposing sequential search loops, beginning with outermost loops and working inward. In this process, the main challenge is to detect interdependencies among loops and partitioning the loops so as to preserve concurrency. Existing parallel programs are based upon either shared memory or distributed memory paradigms. In PVM, master and node programs are used to execute parallelization and process spawning. The PVM can handle process management and process addressing schemes using a virtual machine configuration. The task scheduling and the messaging and signaling can be implemented efficiently for the LISA Gravitational Wave search process using a master and 6 nodes. This approach is accomplished using a server that is available at NASA Ames Research Center, and has been dedicated to the LISA Data Challenge Competition. Historically, gravitational wave and source identification parameters have taken around 7 days in this dedicated single thread Linux based server. Using PVM approach, the parameter extraction problem can be reduced to within a day. The low frequency computation and a proxy signal-to-noise ratio are calculated in separate nodes that are controlled by the master using message and vector of data passing. The message passing among nodes follows a pattern of synchronous and asynchronous send-and-receive protocols. The communication model and the message buffers are allocated dynamically to address rapid search of gravitational wave source information in the Mock LISA data sets.
A parallel Monte Carlo code for planar and SPECT imaging: implementation, verification and applications in (131)I SPECT.

PubMed

Dewaraja, Yuni K; Ljungberg, Michael; Majumdar, Amitava; Bose, Abhijit; Koral, Kenneth F

2002-02-01

This paper reports the implementation of the SIMIND Monte Carlo code on an IBM SP2 distributed memory parallel computer. Basic aspects of running Monte Carlo particle transport calculations on parallel architectures are described. Our parallelization is based on equally partitioning photons among the processors and uses the Message Passing Interface (MPI) library for interprocessor communication and the Scalable Parallel Random Number Generator (SPRNG) to generate uncorrelated random number streams. These parallelization techniques are also applicable to other distributed memory architectures. A linear increase in computing speed with the number of processors is demonstrated for up to 32 processors. This speed-up is especially significant in Single Photon Emission Computed Tomography (SPECT) simulations involving higher energy photon emitters, where explicit modeling of the phantom and collimator is required. For (131)I, the accuracy of the parallel code is demonstrated by comparing simulated and experimental SPECT images from a heart/thorax phantom. Clinically realistic SPECT simulations using the voxel-man phantom are carried out to assess scatter and attenuation correction.
Research in Parallel Algorithms and Software for Computational Aerosciences

NASA Technical Reports Server (NTRS)

Domel, Neal D.

1996-01-01

Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
The implementation of an aeronautical CFD flow code onto distributed memory parallel systems

NASA Astrophysics Data System (ADS)

Ierotheou, C. S.; Forsey, C. R.; Leatham, M.

2000-04-01

The parallelization of an industrially important in-house computational fluid dynamics (CFD) code for calculating the airflow over complex aircraft configurations using the Euler or Navier-Stokes equations is presented. The code discussed is the flow solver module of the SAUNA CFD suite. This suite uses a novel grid system that may include block-structured hexahedral or pyramidal grids, unstructured tetrahedral grids or a hybrid combination of both. To assist in the rapid convergence to a solution, a number of convergence acceleration techniques are employed including implicit residual smoothing and a multigrid full approximation storage scheme (FAS). Key features of the parallelization approach are the use of domain decomposition and encapsulated message passing to enable the execution in parallel using a single programme multiple data (SPMD) paradigm. In the case where a hybrid grid is used, a unified grid partitioning scheme is employed to define the decomposition of the mesh. The parallel code has been tested using both structured and hybrid grids on a number of different distributed memory parallel systems and is now routinely used to perform industrial scale aeronautical simulations. Copyright
Research in Parallel Algorithms and Software for Computational Aerosciences

NASA Technical Reports Server (NTRS)

Domel, Neal D.

1996-01-01

Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Statistics of Epidemics in Networks by Passing Messages

NASA Astrophysics Data System (ADS)

Shrestha, Munik Kumar

Epidemic processes are common out-of-equilibrium phenomena of broad interdisciplinary interest. In this thesis, we show how message-passing approach can be a helpful tool for simulating epidemic models in disordered medium like networks, and in particular for estimating the probability that a given node will become infectious at a particular time. The sort of dynamics we consider are stochastic, where randomness can arise from the stochastic events or from the randomness of network structures. As in belief propagation, variables or messages in message-passing approach are defined on the directed edges of a network. However, unlike belief propagation, where the posterior distributions are updated according to Bayes' rule, in message-passing approach we write differential equations for the messages over time. It takes correlations between neighboring nodes into account while preventing causal signals from backtracking to their immediate source, and thus avoids "echo chamber effects" where a pair of adjacent nodes each amplify the probability that the other is infectious. In our first results, we develop a message-passing approach to threshold models of behavior popular in sociology. These are models, first proposed by Granovetter, where individuals have to hear about a trend or behavior from some number of neighbors before adopting it themselves. In thermodynamic limit of large random networks, we provide an exact analytic scheme while calculating the time dependence of the probabilities and thus learning about the whole dynamics of bootstrap percolation, which is a simple model known in statistical physics for exhibiting discontinuous phase transition. As an application, we apply a similar model to financial networks, studying when bankruptcies spread due to the sudden devaluation of shared assets in overlapping portfolios. We predict that although diversification may be good for individual institutions, it can create dangerous systemic effects, and as a result financial contagion gets worse with too much diversification. We also predict that financial system exhibits "robust yet fragile" behavior, with regions of the parameter space where contagion is rare but catastrophic whenever it occurs. In further results, we develop a message-passing approach to recurrent state epidemics like susceptible-infectious-susceptible and susceptible-infectious-recovered-susceptible where nodes can return to previously inhabited states and multiple waves of infection can pass through the population. Given that message-passing has been applied exclusively to models with one-way state changes like susceptible-infectious and susceptible-infectious-recovered, we develop message-passing for recurrent epidemics based on a new class of differential equations and demonstrate that our approach is simple and efficiently approximates results obtained from Monte Carlo simulation, and that the accuracy of message-passing is often superior to the pair approximation (which also takes second-order correlations into account).
Tuning collective communication for Partitioned Global Address Space programming models

DOE PAGES

Nishtala, Rajesh; Zheng, Yili; Hargrove, Paul H.; ...

2011-06-12

Partitioned Global Address Space (PGAS) languages offer programmers the convenience of a shared memory programming style combined with locality control necessary to run on large-scale distributed memory systems. Even within a PGAS language programmers often need to perform global communication operations such as broadcasts or reductions, which are best performed as collective operations in which a group of threads work together to perform the operation. In this study we consider the problem of implementing collective communication within PGAS languages and explore some of the design trade-offs in both the interface and implementation. In particular, PGAS collectives have semantic issues thatmore » are different than in send–receive style message passing programs, and different implementation approaches that take advantage of the one-sided communication style in these languages. We present an implementation framework for PGAS collectives as part of the GASNet communication layer, which supports shared memory, distributed memory and hybrids. The framework supports a broad set of algorithms for each collective, over which the implementation may be automatically tuned. In conclusion, we demonstrate the benefit of optimized GASNet collectives using application benchmarks written in UPC, and demonstrate that the GASNet collectives can deliver scalable performance on a variety of state-of-the-art parallel machines including a Cray XT4, an IBM BlueGene/P, and a Sun Constellation system with InfiniBand interconnect.« less
GenASiS Basics: Object-oriented utilitarian functionality for large-scale physics simulations

DOE PAGES

Cardall, Christian Y.; Budiardja, Reuben D.

2015-06-11

Aside from numerical algorithms and problem setup, large-scale physics simulations on distributed-memory supercomputers require more basic utilitarian functionality, such as physical units and constants; display to the screen or standard output device; message passing; I/O to disk; and runtime parameter management and usage statistics. Here we describe and make available Fortran 2003 classes furnishing extensible object-oriented implementations of this sort of rudimentary functionality, along with individual `unit test' programs and larger example problems demonstrating their use. Lastly, these classes compose the Basics division of our developing astrophysics simulation code GenASiS (General Astrophysical Simulation System), but their fundamental nature makes themmore » useful for physics simulations in many fields.« less
Portable programming on parallel/networked computers using the Application Portable Parallel Library (APPL)

NASA Technical Reports Server (NTRS)

Quealy, Angela; Cole, Gary L.; Blech, Richard A.

1993-01-01

The Application Portable Parallel Library (APPL) is a subroutine-based library of communication primitives that is callable from applications written in FORTRAN or C. APPL provides a consistent programmer interface to a variety of distributed and shared-memory multiprocessor MIMD machines. The objective of APPL is to minimize the effort required to move parallel applications from one machine to another, or to a network of homogeneous machines. APPL encompasses many of the message-passing primitives that are currently available on commercial multiprocessor systems. This paper describes APPL (version 2.3.1) and its usage, reports the status of the APPL project, and indicates possible directions for the future. Several applications using APPL are discussed, as well as performance and overhead results.
National Combustion Code Parallel Performance Enhancements

NASA Technical Reports Server (NTRS)

Quealy, Angela; Benyo, Theresa (Technical Monitor)

2002-01-01

The National Combustion Code (NCC) is being developed by an industry-government team for the design and analysis of combustion systems. The unstructured grid, reacting flow code uses a distributed memory, message passing model for its parallel implementation. The focus of the present effort has been to improve the performance of the NCC code to meet combustor designer requirements for model accuracy and analysis turnaround time. Improving the performance of this code contributes significantly to the overall reduction in time and cost of the combustor design cycle. This report describes recent parallel processing modifications to NCC that have improved the parallel scalability of the code, enabling a two hour turnaround for a 1.3 million element fully reacting combustion simulation on an SGI Origin 2000.
A Multiple Sphere T-Matrix Fortran Code for Use on Parallel Computer Clusters

NASA Technical Reports Server (NTRS)

Mackowski, D. W.; Mishchenko, M. I.

2011-01-01

A general-purpose Fortran-90 code for calculation of the electromagnetic scattering and absorption properties of multiple sphere clusters is described. The code can calculate the efficiency factors and scattering matrix elements of the cluster for either fixed or random orientation with respect to the incident beam and for plane wave or localized- approximation Gaussian incident fields. In addition, the code can calculate maps of the electric field both interior and exterior to the spheres.The code is written with message passing interface instructions to enable the use on distributed memory compute clusters, and for such platforms the code can make feasible the calculation of absorption, scattering, and general EM characteristics of systems containing several thousand spheres.
Networking and AI systems: Requirements and benefits

NASA Technical Reports Server (NTRS)

1988-01-01

The price performance benefits of network systems is well documented. The ability to share expensive resources sold timesharing for mainframes, department clusters of minicomputers, and now local area networks of workstations and servers. In the process, other fundamental system requirements emerged. These have now been generalized with open system requirements for hardware, software, applications and tools. The ability to interconnect a variety of vendor products has led to a specification of interfaces that allow new techniques to extend existing systems for new and exciting applications. As an example of the message passing system, local area networks provide a testbed for many of the issues addressed by future concurrent architectures: synchronization, load balancing, fault tolerance and scalability. Gold Hill has been working with a number of vendors on distributed architectures that range from a network of workstations to a hypercube of microprocessors with distributed memory. Results from early applications are promising both for performance and scalability.
Distributed Processor/Memory Architectures Design Program

DTIC Science & Technology

1975-02-01

Event Scheduling Plo 31 Globat LAl Message Input Event Sicheduling Fhou ..... ............... 106 32 It tc Iata Representation...298 138 GEX LEX Scheduling Phlmophy ....... ...................... 300 139 Executive Comirol Herarchy... Scheduler Subroutine lnterrelatiomhips . ..... ................. 312 145 Task Scheduler Message Scatuer. . ...... ....................... 315 146
Twin Neurons for Efficient Real-World Data Distribution in Networks of Neural Cliques: Applications in Power Management in Electronic Circuits.

PubMed

Boguslawski, Bartosz; Gripon, Vincent; Seguin, Fabrice; Heitzmann, Frédéric

2016-02-01

Associative memories are data structures that allow retrieval of previously stored messages given part of their content. They, thus, behave similarly to the human brain's memory that is capable, for instance, of retrieving the end of a song, given its beginning. Among different families of associative memories, sparse ones are known to provide the best efficiency (ratio of the number of bits stored to that of the bits used). Recently, a new family of sparse associative memories achieving almost optimal efficiency has been proposed. Their structure, relying on binary connections and neurons, induces a direct mapping between input messages and stored patterns. Nevertheless, it is well known that nonuniformity of the stored messages can lead to a dramatic decrease in performance. In this paper, we show the impact of nonuniformity on the performance of this recent model, and we exploit the structure of the model to improve its performance in practical applications, where data are not necessarily uniform. In order to approach the performance of networks with uniformly distributed messages presented in theoretical studies, twin neurons are introduced. To assess the adapted model, twin neurons are used with the real-world data to optimize power consumption of electronic circuits in practical test cases.
Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4

DOE Office of Scientific and Technical Information (OSTI.GOV)

Computational Research Division, Lawrence Berkeley National Laboratory; NERSC, Lawrence Berkeley National Laboratory; Computer Science Department, University of California, Berkeley

2009-05-04

We apply auto-tuning to a hybrid MPI-pthreads lattice Boltzmann computation running on the Cray XT4 at National Energy Research Scientific Computing Center (NERSC). Previous work showed that multicore-specific auto-tuning can improve the performance of lattice Boltzmann magnetohydrodynamics (LBMHD) by a factor of 4x when running on dual- and quad-core Opteron dual-socket SMPs. We extend these studies to the distributed memory arena via a hybrid MPI/pthreads implementation. In addition to conventional auto-tuning at the local SMP node, we tune at the message-passing level to determine the optimal aspect ratio as well as the correct balance between MPI tasks and threads permore » MPI task. Our study presents a detailed performance analysis when moving along an isocurve of constant hardware usage: fixed total memory, total cores, and total nodes. Overall, our work points to approaches for improving intra- and inter-node efficiency on large-scale multicore systems for demanding scientific applications.« less
Enhancing Application Performance Using Mini-Apps: Comparison of Hybrid Parallel Programming Paradigms

NASA Technical Reports Server (NTRS)

Lawson, Gary; Sosonkina, Masha; Baurle, Robert; Hammond, Dana

2017-01-01

In many fields, real-world applications for High Performance Computing have already been developed. For these applications to stay up-to-date, new parallel strategies must be explored to yield the best performance; however, restructuring or modifying a real-world application may be daunting depending on the size of the code. In this case, a mini-app may be employed to quickly explore such options without modifying the entire code. In this work, several mini-apps have been created to enhance a real-world application performance, namely the VULCAN code for complex flow analysis developed at the NASA Langley Research Center. These mini-apps explore hybrid parallel programming paradigms with Message Passing Interface (MPI) for distributed memory access and either Shared MPI (SMPI) or OpenMP for shared memory accesses. Performance testing shows that MPI+SMPI yields the best execution performance, while requiring the largest number of code changes. A maximum speedup of 23 was measured for MPI+SMPI, but only 11 was measured for MPI+OpenMP.

Principles for problem aggregation and assignment in medium scale multiprocessors

NASA Technical Reports Server (NTRS)

Nicol, David M.; Saltz, Joel H.

1987-01-01

One of the most important issues in parallel processing is the mapping of workload to processors. This paper considers a large class of problems having a high degree of potential fine grained parallelism, and execution requirements that are either not predictable, or are too costly to predict. The main issues in mapping such a problem onto medium scale multiprocessors are those of aggregation and assignment. We study a method of parameterized aggregation that makes few assumptions about the workload. The mapping of aggregate units of work onto processors is uniform, and exploits locality of workload intensity to balance the unknown workload. In general, a finer aggregate granularity leads to a better balance at the price of increased communication/synchronization costs; the aggregation parameters can be adjusted to find a reasonable granularity. The effectiveness of this scheme is demonstrated on three model problems: an adaptive one-dimensional fluid dynamics problem with message passing, a sparse triangular linear system solver on both a shared memory and a message-passing machine, and a two-dimensional time-driven battlefield simulation employing message passing. Using the model problems, the tradeoffs are studied between balanced workload and the communication/synchronization costs. Finally, an analytical model is used to explain why the method balances workload and minimizes the variance in system behavior.
Run-time scheduling and execution of loops on message passing machines

NASA Technical Reports Server (NTRS)

Crowley, Kay; Saltz, Joel; Mirchandaney, Ravi; Berryman, Harry

1989-01-01

Sparse system solvers and general purpose codes for solving partial differential equations are examples of the many types of problems whose irregularity can result in poor performance on distributed memory machines. Often, the data structures used in these problems are very flexible. Crucial details concerning loop dependences are encoded in these structures rather than being explicitly represented in the program. Good methods for parallelizing and partitioning these types of problems require assignment of computations in rather arbitrary ways. Naive implementations of programs on distributed memory machines requiring general loop partitions can be extremely inefficient. Instead, the scheduling mechanism needs to capture the data reference patterns of the loops in order to partition the problem. First, the indices assigned to each processor must be locally numbered. Next, it is necessary to precompute what information is needed by each processor at various points in the computation. The precomputed information is then used to generate an execution template designed to carry out the computation, communication, and partitioning of data, in an optimized manner. The design is presented for a general preprocessor and schedule executer, the structures of which do not vary, even though the details of the computation and of the type of information are problem dependent.
Run-time scheduling and execution of loops on message passing machines

NASA Technical Reports Server (NTRS)

Saltz, Joel; Crowley, Kathleen; Mirchandaney, Ravi; Berryman, Harry

1990-01-01

Sparse system solvers and general purpose codes for solving partial differential equations are examples of the many types of problems whose irregularity can result in poor performance on distributed memory machines. Often, the data structures used in these problems are very flexible. Crucial details concerning loop dependences are encoded in these structures rather than being explicitly represented in the program. Good methods for parallelizing and partitioning these types of problems require assignment of computations in rather arbitrary ways. Naive implementations of programs on distributed memory machines requiring general loop partitions can be extremely inefficient. Instead, the scheduling mechanism needs to capture the data reference patterns of the loops in order to partition the problem. First, the indices assigned to each processor must be locally numbered. Next, it is necessary to precompute what information is needed by each processor at various points in the computation. The precomputed information is then used to generate an execution template designed to carry out the computation, communication, and partitioning of data, in an optimized manner. The design is presented for a general preprocessor and schedule executer, the structures of which do not vary, even though the details of the computation and of the type of information are problem dependent.
Design considerations for parallel graphics libraries

NASA Technical Reports Server (NTRS)

Crockett, Thomas W.

1994-01-01

Applications which run on parallel supercomputers are often characterized by massive datasets. Converting these vast collections of numbers to visual form has proven to be a powerful aid to comprehension. For a variety of reasons, it may be desirable to provide this visual feedback at runtime. One way to accomplish this is to exploit the available parallelism to perform graphics operations in place. In order to do this, we need appropriate parallel rendering algorithms and library interfaces. This paper provides a tutorial introduction to some of the issues which arise in designing parallel graphics libraries and their underlying rendering algorithms. The focus is on polygon rendering for distributed memory message-passing systems. We illustrate our discussion with examples from PGL, a parallel graphics library which has been developed on the Intel family of parallel systems.
Matrix multiplication on the Intel Touchstone Delta

DOE Office of Scientific and Technical Information (OSTI.GOV)

Huss-Lederman, S.; Jacobson, E.M.; Tsao, A.

1993-12-31

Matrix multiplication is a key primitive in block matrix algorithms such as those found in LAPACK. We present results from our study of matrix multiplication algorithms on the Intel Touchstone Delta, a distributed memory message-passing architecture with a two-dimensional mesh topology. We obtain an implementation that uses communication primitives highly suited to the Delta and exploits the single node assembly-coded matrix multiplication. Our algorithm is completely general, able to deal with arbitrary mesh aspect ratios and matrix dimensions, and has achieved parallel efficiency of 86% with overall peak performance in excess of 8 Gflops on 256 nodes for an 8800more » {times} 8800 matrix. We describe our algorithm design and implementation, and present performance results that demonstrate scalability and robust behavior over varying mesh topologies.« less
NavP: Structured and Multithreaded Distributed Parallel Programming

NASA Technical Reports Server (NTRS)

Pan, Lei

2007-01-01

We present Navigational Programming (NavP) -- a distributed parallel programming methodology based on the principles of migrating computations and multithreading. The four major steps of NavP are: (1) Distribute the data using the data communication pattern in a given algorithm; (2) Insert navigational commands for the computation to migrate and follow large-sized distributed data; (3) Cut the sequential migrating thread and construct a mobile pipeline; and (4) Loop back for refinement. NavP is significantly different from the current prevailing Message Passing (MP) approach. The advantages of NavP include: (1) NavP is structured distributed programming and it does not change the code structure of an original algorithm. This is in sharp contrast to MP as MP implementations in general do not resemble the original sequential code; (2) NavP implementations are always competitive with the best MPI implementations in terms of performance. Approaches such as DSM or HPF have failed to deliver satisfying performance as of today in contrast, even if they are relatively easy to use compared to MP; (3) NavP provides incremental parallelization, which is beyond the reach of MP; and (4) NavP is a unifying approach that allows us to exploit both fine- (multithreading on shared memory) and coarse- (pipelined tasks on distributed memory) grained parallelism. This is in contrast to the currently popular hybrid use of MP+OpenMP, which is known to be complex to use. We present experimental results that demonstrate the effectiveness of NavP.
A manual for PARTI runtime primitives

NASA Technical Reports Server (NTRS)

Berryman, Harry; Saltz, Joel

1990-01-01

Primitives are presented that are designed to help users efficiently program irregular problems (e.g., unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These primitives are also designed for use in compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and receive messages are automatically generated.
Extensions to the Parallel Real-Time Artificial Intelligence System (PRAIS) for fault-tolerant heterogeneous cycle-stealing reasoning

NASA Technical Reports Server (NTRS)

Goldstein, David

1991-01-01

Extensions to an architecture for real-time, distributed (parallel) knowledge-based systems called the Parallel Real-time Artificial Intelligence System (PRAIS) are discussed. PRAIS strives for transparently parallelizing production (rule-based) systems, even under real-time constraints. PRAIS accomplished these goals (presented at the first annual C Language Integrated Production System (CLIPS) conference) by incorporating a dynamic task scheduler, operating system extensions for fact handling, and message-passing among multiple copies of CLIPS executing on a virtual blackboard. This distributed knowledge-based system tool uses the portability of CLIPS and common message-passing protocols to operate over a heterogeneous network of processors. Results using the original PRAIS architecture over a network of Sun 3's, Sun 4's and VAX's are presented. Mechanisms using the producer-consumer model to extend the architecture for fault-tolerance and distributed truth maintenance initiation are also discussed.
Interprocess Communication in Highly Distributed Systems - A Workshop Report - 20 to 22, November 1978.

DTIC Science & Technology

1979-12-01

Links between processes can be aLlocated strictLy to controL functions. In fact, the degree of separation of control and data is an important research is...delays or Loss of control messages. Cognoscienti agree that message-passing IPC schemes are equivalent in "power" to schemes which employ shared...THEORETICAL WORK Page 55 SECTION 6 THEORETICAL WORK 6.1 WORKING GRUP JIM REP.OR STRUCTURE of Discussion: Distributed system without central (or any) control
Messaging to Increase Public Support for Naloxone Distribution Policies in the United States: Results from a Randomized Survey Experiment.

PubMed

Bachhuber, Marcus A; McGinty, Emma E; Kennedy-Hendricks, Alene; Niederdeppe, Jeff; Barry, Colleen L

2015-01-01

Barriers to public support for naloxone distribution include lack of knowledge, concerns about potential unintended consequences, and lack of sympathy for people at risk of overdose. A randomized survey experiment was conducted with a nationally-representative web-based survey research panel (GfK KnowledgePanel). Participants were randomly assigned to read different messages alone or in combination: 1) factual information about naloxone; 2) pre-emptive refutation of potential concerns about naloxone distribution; and 3) a sympathetic narrative about a mother whose daughter died of an opioid overdose. Participants were then asked if they support or oppose policies related to naloxone distribution. For each policy item, logistic regression models were used to test the effect of each message exposure compared with the no-exposure control group. The final sample consisted of 1,598 participants (completion rate: 72.6%). Factual information and the sympathetic narrative alone each led to higher support for training first responders to use naloxone, providing naloxone to friends and family members of people using opioids, and passing laws to protect people who administer naloxone. Participants receiving the combination of the sympathetic narrative and factual information, compared to factual information alone, were more likely to support all policies: providing naloxone to friends and family members (OR: 2.0 [95% CI: 1.4 to 2.9]), training first responders to use naloxone (OR: 2.0 [95% CI: 1.2 to 3.4]), passing laws to protect people if they administer naloxone (OR: 1.5 [95% CI: 1.04 to 2.2]), and passing laws to protect people if they call for medical help for an overdose (OR: 1.7 [95% CI: 1.2 to 2.5]). All messages increased public support, but combining factual information and the sympathetic narrative was most effective. Public support for naloxone distribution can be improved through education and sympathetic portrayals of the population who stands to benefit from these policies.
Design of a network for concurrent message passing systems

NASA Astrophysics Data System (ADS)

Song, Paul Y.

1988-08-01

We describe the design of the network design frame (NDF), a self-timed routing chip for a message-passing concurrent computer. The NDF uses a partitioned data path, low-voltage output drivers, and a distributed token-passing arbiter to provide a bandwidth of 450 Mbits/sec into the network. Wormhole routing and bidirectional virtual channels are used to provide low latency communications, less than 2us latency to deliver a 216 bit message across the diameter of a 1K node mess-connected machine. To support concurrent software systems, the NDF provides two logical networks, one for user messages and one for system messages. The two networks share the same set of physical wires. To facilitate the development of network nodes, the NDF is a design frame. The NDF circuitry is integrated into the pad frame of a chip leaving the center of the chip uncommitted. We define an analytic framework in which to study the effects of network size, network buffering capacity, bidirectional channels, and traffic on this class of networks. The response of the network to various combinations of these parameters are obtained through extensive simulation of the network model. Through simulation, we are able to observe the macro behavior of the network as opposed to the micro behavior of the NDF routing controller.
A manual for PARTI runtime primitives, revision 1

NASA Technical Reports Server (NTRS)

Das, Raja; Saltz, Joel; Berryman, Harry

1991-01-01

Primitives are presented that are designed to help users efficiently program irregular problems (e.g., unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These primitives are also designed for use in compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and receive messages are automatically generated.
High-performance computing — an overview

NASA Astrophysics Data System (ADS)

Marksteiner, Peter

1996-08-01

An overview of high-performance computing (HPC) is given. Different types of computer architectures used in HPC are discussed: vector supercomputers, high-performance RISC processors, various parallel computers like symmetric multiprocessors, workstation clusters, massively parallel processors. Software tools and programming techniques used in HPC are reviewed: vectorizing compilers, optimization and vector tuning, optimization for RISC processors; parallel programming techniques like shared-memory parallelism, message passing and data parallelism; and numerical libraries.
Control Transfer in Operating System Kernels

DTIC Science & Technology

1994-05-13

microkernel system that runs less code in the kernel address space. To realize the performance benefit of allocating stacks in unmapped kseg0 memory, the...review how I modified the Mach 3.0 kernel to use continuations. Because of Mach’s message-passing microkernel structure, interprocess communication was...critical control transfer paths, deeply- nested call chains are undesirable in any case because of the function call overhead. 4.1.3 Microkernel Operating
Parallel Conjugate Gradient: Effects of Ordering Strategies, Programming Paradigms, and Architectural Platforms

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Heber, Gerd; Biswas, Rupak

2000-01-01

The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the floating-point operations within a CG iteration. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and SPMV using different programming paradigms and architectures. Results show that for this class of applications, ordering significantly improves overall performance, that cache reuse may be more important than reducing communication, and that it is possible to achieve message passing performance using shared memory constructs through careful data ordering and distribution. However, a multi-threaded implementation of CG on the Tera MTA does not require special ordering or partitioning to obtain high efficiency and scalability.
A distributed infrastructure for publishing VO services: an implementation

NASA Astrophysics Data System (ADS)

Cepparo, Francesco; Scagnetto, Ivan; Molinaro, Marco; Smareglia, Riccardo

2016-07-01

This contribution describes both the design and the implementation details of a new solution for publishing VO services, enlightening its maintainable, distributed, modular and scalable architecture. Indeed, the new publisher is multithreaded and multiprocess. Multiple instances of the modules can run on different machines to ensure high performance and high availability, and this will be true both for the interface modules of the services and the back end data access ones. The system uses message passing to let its components communicate through an AMQP message broker that can itself be distributed to provide better scalability and availability.
Intel NX to PVM 3.2 message passing conversion library

NASA Technical Reports Server (NTRS)

Arthur, Trey; Nelson, Michael L.

1993-01-01

NASA Langley Research Center has developed a library that allows Intel NX message passing codes to be executed under the more popular and widely supported Parallel Virtual Machine (PVM) message passing library. PVM was developed at Oak Ridge National Labs and has become the defacto standard for message passing. This library will allow the many programs that were developed on the Intel iPSC/860 or Intel Paragon in a Single Program Multiple Data (SPMD) design to be ported to the numerous architectures that PVM (version 3.2) supports. Also, the library adds global operations capability to PVM. A familiarity with Intel NX and PVM message passing is assumed.
Aerodynamic Shape Optimization of Supersonic Aircraft Configurations via an Adjoint Formulation on Parallel Computers

NASA Technical Reports Server (NTRS)

Reuther, James; Alonso, Juan Jose; Rimlinger, Mark J.; Jameson, Antony

1996-01-01

This work describes the application of a control theory-based aerodynamic shape optimization method to the problem of supersonic aircraft design. The design process is greatly accelerated through the use of both control theory and a parallel implementation on distributed memory computers. Control theory is employed to derive the adjoint differential equations whose solution allows for the evaluation of design gradient information at a fraction of the computational cost required by previous design methods. The resulting problem is then implemented on parallel distributed memory architectures using a domain decomposition approach, an optimized communication schedule, and the MPI (Message Passing Interface) Standard for portability and efficiency. The final result achieves very rapid aerodynamic design based on higher order computational fluid dynamics methods (CFD). In our earlier studies, the serial implementation of this design method was shown to be effective for the optimization of airfoils, wings, wing-bodies, and complex aircraft configurations using both the potential equation and the Euler equations. In our most recent paper, the Euler method was extended to treat complete aircraft configurations via a new multiblock implementation. Furthermore, during the same conference, we also presented preliminary results demonstrating that this basic methodology could be ported to distributed memory parallel computing architectures. In this paper, our concern will be to demonstrate that the combined power of these new technologies can be used routinely in an industrial design environment by applying it to the case study of the design of typical supersonic transport configurations. A particular difficulty of this test case is posed by the propulsion/airframe integration.
Distributed Memory Parallel Computing with SEAWAT

NASA Astrophysics Data System (ADS)

Verkaik, J.; Huizer, S.; van Engelen, J.; Oude Essink, G.; Ram, R.; Vuik, K.

2017-12-01

Fresh groundwater reserves in coastal aquifers are threatened by sea-level rise, extreme weather conditions, increasing urbanization and associated groundwater extraction rates. To counteract these threats, accurate high-resolution numerical models are required to optimize the management of these precious reserves. The major model drawbacks are long run times and large memory requirements, limiting the predictive power of these models. Distributed memory parallel computing is an efficient technique for reducing run times and memory requirements, where the problem is divided over multiple processor cores. A new Parallel Krylov Solver (PKS) for SEAWAT is presented. PKS has recently been applied to MODFLOW and includes Conjugate Gradient (CG) and Biconjugate Gradient Stabilized (BiCGSTAB) linear accelerators. Both accelerators are preconditioned by an overlapping additive Schwarz preconditioner in a way that: a) subdomains are partitioned using Recursive Coordinate Bisection (RCB) load balancing, b) each subdomain uses local memory only and communicates with other subdomains by Message Passing Interface (MPI) within the linear accelerator, c) it is fully integrated in SEAWAT. Within SEAWAT, the PKS-CG solver replaces the Preconditioned Conjugate Gradient (PCG) solver for solving the variable-density groundwater flow equation and the PKS-BiCGSTAB solver replaces the Generalized Conjugate Gradient (GCG) solver for solving the advection-diffusion equation. PKS supports the third-order Total Variation Diminishing (TVD) scheme for computing advection. Benchmarks were performed on the Dutch national supercomputer (https://userinfo.surfsara.nl/systems/cartesius) using up to 128 cores, for a synthetic 3D Henry model (100 million cells) and the real-life Sand Engine model ( 10 million cells). The Sand Engine model was used to investigate the potential effect of the long-term morphological evolution of a large sand replenishment and climate change on fresh groundwater resources. Speed-ups up to 40 were obtained with the new PKS solver.
An Efficient Multiblock Method for Aerodynamic Analysis and Design on Distributed Memory Systems

NASA Technical Reports Server (NTRS)

Reuther, James; Alonso, Juan Jose; Vassberg, John C.; Jameson, Antony; Martinelli, Luigi

1997-01-01

The work presented in this paper describes the application of a multiblock gridding strategy to the solution of aerodynamic design optimization problems involving complex configurations. The design process is parallelized using the MPI (Message Passing Interface) Standard such that it can be efficiently run on a variety of distributed memory systems ranging from traditional parallel computers to networks of workstations. Substantial improvements to the parallel performance of the baseline method are presented, with particular attention to their impact on the scalability of the program as a function of the mesh size. Drag minimization calculations at a fixed coefficient of lift are presented for a business jet configuration that includes the wing, body, pylon, aft-mounted nacelle, and vertical and horizontal tails. An aerodynamic design optimization is performed with both the Euler and Reynolds Averaged Navier-Stokes (RANS) equations governing the flow solution and the results are compared. These sample calculations establish the feasibility of efficient aerodynamic optimization of complete aircraft configurations using the RANS equations as the flow model. There still exists, however, the need for detailed studies of the importance of a true viscous adjoint method which holds the promise of tackling the minimization of not only the wave and induced components of drag, but also the viscous drag.

Theoretic derivation of directed acyclic subgraph algorithm and comparisons with message passing algorithm

NASA Astrophysics Data System (ADS)

Ha, Jeongmok; Jeong, Hong

2016-07-01

This study investigates the directed acyclic subgraph (DAS) algorithm, which is used to solve discrete labeling problems much more rapidly than other Markov-random-field-based inference methods but at a competitive accuracy. However, the mechanism by which the DAS algorithm simultaneously achieves competitive accuracy and fast execution speed, has not been elucidated by a theoretical derivation. We analyze the DAS algorithm by comparing it with a message passing algorithm. Graphical models, inference methods, and energy-minimization frameworks are compared between DAS and message passing algorithms. Moreover, the performances of DAS and other message passing methods [sum-product belief propagation (BP), max-product BP, and tree-reweighted message passing] are experimentally compared.
A Parallel Workload Model and its Implications for Processor Allocation

DTIC Science & Technology

1996-11-01

with SEV or AVG, both of which can tolerate c = 0.4 { 0.6 before their performance deteriorates signi cantly. On the other hand, Setia [10] has...Sanjeev. K Setia . The interaction between memory allocation and adaptive partitioning in message-passing multicomputers. In IPPS 󈨣 Workshop on Job...Scheduling Strategies for Parallel Processing, pages 89{99, 1995. [11] Sanjeev K. Setia and Satish K. Tripathi. An analysis of several processor
A Survey of Rollback-Recovery Protocols in Message-Passing Systems

DTIC Science & Technology

1999-06-01

and M.A. Castillo. "Checkpointing through garbage collection." Technical report. Departamento de Ciencia de la Computation, Escuela de Ingenieria ...between consecutive checkpoints. It can be implemented by using the dirty-bit of the memory protection hardware or by emulating a dirty-bit in software [4...compare the program’s state with the previous checkpoint in software , and writing the difference in a new checkpoint [46]. The required storage and
An HLA-Based Approach to Quantify Achievable Performance for Tactical Edge Applications

DTIC Science & Technology

2011-05-01

in: Proceedings of the 2002 Fall Simulation Interoperability Workshop, 02F- SIW -068, Nov 2002. [16] P. Knight, et al. ―WBT RTI Independent...Benchmark Tests: Design, Implementation, and Updated Results‖, in: Proceedings of the 2002 Spring Simulation Interoperability Workshop, 02S- SIW -081, March...Interoperability Workshop, 98F- SIW -085, Nov 1998. [18] S. Ferenci and R. Fujimoto. ―RTI Performance on Shared Memory and Message Passing Architectures‖, in
Internode data communications in a parallel computer

DOEpatents

Archer, Charles J.; Blocksome, Michael A.; Miller, Douglas R.; Parker, Jeffrey J.; Ratterman, Joseph D.; Smith, Brian E.

2013-09-03

Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.
Internode data communications in a parallel computer

DOEpatents

Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Parker, Jeffrey J; Ratterman, Joseph D; Smith, Brian E

2014-02-11

Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.
Performance issues for domain-oriented time-driven distributed simulations

NASA Technical Reports Server (NTRS)

Nicol, David M.

1987-01-01

It has long been recognized that simulations form an interesting and important class of computations that may benefit from distributed or parallel processing. Since the point of parallel processing is improved performance, the recent proliferation of multiprocessors requires that we consider the performance issues that naturally arise when attempting to implement a distributed simulation. Three such issues are: (1) the problem of mapping the simulation onto the architecture, (2) the possibilities for performing redundant computation in order to reduce communication, and (3) the avoidance of deadlock due to distributed contention for message-buffer space. These issues are discussed in the context of a battlefield simulation implemented on a medium-scale multiprocessor message-passing architecture.
Efficient Data Generation and Publication as a Test Tool

NASA Technical Reports Server (NTRS)

Einstein, Craig Jakob

2017-01-01

A tool to facilitate the generation and publication of test data was created to test the individual components of a command and control system designed to launch spacecraft. Specifically, this tool was built to ensure messages are properly passed between system components. The tool can also be used to test whether the appropriate groups have access (read/write privileges) to the correct messages. The messages passed between system components take the form of unique identifiers with associated values. These identifiers are alphanumeric strings that identify the type of message and the additional parameters that are contained within the message. The values that are passed with the message depend on the identifier. The data generation tool allows for the efficient creation and publication of these messages. A configuration file can be used to set the parameters of the tool and also specify which messages to pass.
Accelerating atomistic calculations of quantum energy eigenstates on graphic cards

NASA Astrophysics Data System (ADS)

Rodrigues, Walter; Pecchia, A.; Lopez, M.; Auf der Maur, M.; Di Carlo, A.

2014-10-01

Electronic properties of nanoscale materials require the calculation of eigenvalues and eigenvectors of large matrices. This bottleneck can be overcome by parallel computing techniques or the introduction of faster algorithms. In this paper we report a custom implementation of the Lanczos algorithm with simple restart, optimized for graphical processing units (GPUs). The whole algorithm has been developed using CUDA and runs entirely on the GPU, with a specialized implementation that spares memory and reduces at most machine-to-device data transfers. Furthermore parallel distribution over several GPUs has been attained using the standard message passing interface (MPI). Benchmark calculations performed on a GaN/AlGaN wurtzite quantum dot with up to 600,000 atoms are presented. The empirical tight-binding (ETB) model with an sp3d5s∗+spin-orbit parametrization has been used to build the system Hamiltonian (H).
Paradigms and strategies for scientific computing on distributed memory concurrent computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Foster, I.T.; Walker, D.W.

1994-06-01

In this work we examine recent advances in parallel languages and abstractions that have the potential for improving the programmability and maintainability of large-scale, parallel, scientific applications running on high performance architectures and networks. This paper focuses on Fortran M, a set of extensions to Fortran 77 that supports the modular design of message-passing programs. We describe the Fortran M implementation of a particle-in-cell (PIC) plasma simulation application, and discuss issues in the optimization of the code. The use of two other methodologies for parallelizing the PIC application are considered. The first is based on the shared object abstraction asmore » embodied in the Orca language. The second approach is the Split-C language. In Fortran M, Orca, and Split-C the ability of the programmer to control the granularity of communication is important is designing an efficient implementation.« less
Multicore Challenges and Benefits for High Performance Scientific Computing

DOE PAGES

Nielsen, Ida M. B.; Janssen, Curtis L.

2008-01-01

Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexitymore » of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.« less
Discrete sensitivity derivatives of the Navier-Stokes equations with a parallel Krylov solver

NASA Technical Reports Server (NTRS)

Ajmani, Kumud; Taylor, Arthur C., III

1994-01-01

This paper solves an 'incremental' form of the sensitivity equations derived by differentiating the discretized thin-layer Navier Stokes equations with respect to certain design variables of interest. The equations are solved with a parallel, preconditioned Generalized Minimal RESidual (GMRES) solver on a distributed-memory architecture. The 'serial' sensitivity analysis code is parallelized by using the Single Program Multiple Data (SPMD) programming model, domain decomposition techniques, and message-passing tools. Sensitivity derivatives are computed for low and high Reynolds number flows over a NACA 1406 airfoil on a 32-processor Intel Hypercube, and found to be identical to those computed on a single-processor Cray Y-MP. It is estimated that the parallel sensitivity analysis code has to be run on 40-50 processors of the Intel Hypercube in order to match the single-processor processing time of a Cray Y-MP.
National Combustion Code: Parallel Implementation and Performance

NASA Technical Reports Server (NTRS)

Quealy, A.; Ryder, R.; Norris, A.; Liu, N.-S.

2000-01-01

The National Combustion Code (NCC) is being developed by an industry-government team for the design and analysis of combustion systems. CORSAIR-CCD is the current baseline reacting flow solver for NCC. This is a parallel, unstructured grid code which uses a distributed memory, message passing model for its parallel implementation. The focus of the present effort has been to improve the performance of the NCC flow solver to meet combustor designer requirements for model accuracy and analysis turnaround time. Improving the performance of this code contributes significantly to the overall reduction in time and cost of the combustor design cycle. This paper describes the parallel implementation of the NCC flow solver and summarizes its current parallel performance on an SGI Origin 2000. Earlier parallel performance results on an IBM SP-2 are also included. The performance improvements which have enabled a turnaround of less than 15 hours for a 1.3 million element fully reacting combustion simulation are described.
Three-Dimensional High-Lift Analysis Using a Parallel Unstructured Multigrid Solver

NASA Technical Reports Server (NTRS)

Mavriplis, Dimitri J.

1998-01-01

A directional implicit unstructured agglomeration multigrid solver is ported to shared and distributed memory massively parallel machines using the explicit domain-decomposition and message-passing approach. Because the algorithm operates on local implicit lines in the unstructured mesh, special care is required in partitioning the problem for parallel computing. A weighted partitioning strategy is described which avoids breaking the implicit lines across processor boundaries, while incurring minimal additional communication overhead. Good scalability is demonstrated on a 128 processor SGI Origin 2000 machine and on a 512 processor CRAY T3E machine for reasonably fine grids. The feasibility of performing large-scale unstructured grid calculations with the parallel multigrid algorithm is demonstrated by computing the flow over a partial-span flap wing high-lift geometry on a highly resolved grid of 13.5 million points in approximately 4 hours of wall clock time on the CRAY T3E.
Strategies for Energy Efficient Resource Management of Hybrid Programming Models

DOE Office of Scientific and Technical Information (OSTI.GOV)

Li, Dong; Supinski, Bronis de; Schulz, Martin

2013-01-01

Many scientific applications are programmed using hybrid programming models that use both message-passing and shared-memory, due to the increasing prevalence of large-scale systems with multicore, multisocket nodes. Previous work has shown that energy efficiency can be improved using software-controlled execution schemes that consider both the programming model and the power-aware execution capabilities of the system. However, such approaches have focused on identifying optimal resource utilization for one programming model, either shared-memory or message-passing, in isolation. The potential solution space, thus the challenge, increases substantially when optimizing hybrid models since the possible resource configurations increase exponentially. Nonetheless, with the accelerating adoptionmore » of hybrid programming models, we increasingly need improved energy efficiency in hybrid parallel applications on large-scale systems. In this work, we present new software-controlled execution schemes that consider the effects of dynamic concurrency throttling (DCT) and dynamic voltage and frequency scaling (DVFS) in the context of hybrid programming models. Specifically, we present predictive models and novel algorithms based on statistical analysis that anticipate application power and time requirements under different concurrency and frequency configurations. We apply our models and methods to the NPB MZ benchmarks and selected applications from the ASC Sequoia codes. Overall, we achieve substantial energy savings (8.74% on average and up to 13.8%) with some performance gain (up to 7.5%) or negligible performance loss.« less
Execution time supports for adaptive scientific algorithms on distributed memory machines

NASA Technical Reports Server (NTRS)

Berryman, Harry; Saltz, Joel; Scroggs, Jeffrey

1990-01-01

Optimizations are considered that are required for efficient execution of code segments that consists of loops over distributed data structures. The PARTI (Parallel Automated Runtime Toolkit at ICASE) execution time primitives are designed to carry out these optimizations and can be used to implement a wide range of scientific algorithms on distributed memory machines. These primitives allow the user to control array mappings in a way that gives an appearance of shared memory. Computations can be based on a global index set. Primitives are used to carry out gather and scatter operations on distributed arrays. Communications patterns are derived at runtime, and the appropriate send and receive messages are automatically generated.
Execution time support for scientific programs on distributed memory machines

NASA Technical Reports Server (NTRS)

Berryman, Harry; Saltz, Joel; Scroggs, Jeffrey

1990-01-01

Optimizations are considered that are required for efficient execution of code segments that consists of loops over distributed data structures. The PARTI (Parallel Automated Runtime Toolkit at ICASE) execution time primitives are designed to carry out these optimizations and can be used to implement a wide range of scientific algorithms on distributed memory machines. These primitives allow the user to control array mappings in a way that gives an appearance of shared memory. Computations can be based on a global index set. Primitives are used to carry out gather and scatter operations on distributed arrays. Communications patterns are derived at runtime, and the appropriate send and receive messages are automatically generated.
Evaluation Metrics for the Paragon XP/S-15

NASA Technical Reports Server (NTRS)

Traversat, Bernard; McNab, David; Nitzberg, Bill; Fineberg, Sam; Blaylock, Bruce T. (Technical Monitor)

1993-01-01

On February 17th 1993, the Numerical Aerodynamic Simulation (NAS) facility located at the NASA Ames Research Center installed a 224 node Intel Paragon XP/S-15 system. After its installation, the Paragon was found to be in a very immature state and was unable to support a NAS users' workload, composed of a wide range of development and production activities. As a first step towards addressing this problem, we implemented a set of metrics to objectively monitor the system as operating system and hardware upgrades were installed. The metrics were designed to measure four aspects of the system that we consider essential to support our workload: availability, utilization, functionality, and performance. This report presents the metrics collected from February 1993 to August 1993. Since its installation, the Paragon availability has improved from a low of 15% uptime to a high of 80%, while its utilization has remained low. Functionality and performance have improved from merely running one of the NAS Parallel Benchmarks to running all of them faster (between 1 and 2 times) than on the iPSC/860. In spite of the progress accomplished, fundamental limitations of the Paragon operating system are restricting the Paragon from supporting the NAS workload. The maximum operating system message passing (NORMA IPC) bandwidth was measured at 11 Mbytes/s, well below the peak hardware bandwidth (175 Mbytes/s), limiting overall virtual memory and Unix services (i.e. Disk and HiPPI I/O) performance. The high NX application message passing latency (184 microns), three times than on the iPSC/860, was found to significantly degrade performance of applications relying on small message sizes. The amount of memory available for an application was found to be approximately 10 Mbytes per node, indicating that the OS is taking more space than anticipated (6 Mbytes per node).
Aerodynamic Shape Optimization of Supersonic Aircraft Configurations via an Adjoint Formulation on Parallel Computers

NASA Technical Reports Server (NTRS)

Reuther, James; Alonso, Juan Jose; Rimlinger, Mark J.; Jameson, Antony

1996-01-01

This work describes the application of a control theory-based aerodynamic shape optimization method to the problem of supersonic aircraft design. The design process is greatly accelerated through the use of both control theory and a parallel implementation on distributed memory computers. Control theory is employed to derive the adjoint differential equations whose solution allows for the evaluation of design gradient information at a fraction of the computational cost required by previous design methods (13, 12, 44, 38). The resulting problem is then implemented on parallel distributed memory architectures using a domain decomposition approach, an optimized communication schedule, and the MPI (Message Passing Interface) Standard for portability and efficiency. The final result achieves very rapid aerodynamic design based on higher order computational fluid dynamics methods (CFD). In our earlier studies, the serial implementation of this design method (19, 20, 21, 23, 39, 25, 40, 41, 42, 43, 9) was shown to be effective for the optimization of airfoils, wings, wing-bodies, and complex aircraft configurations using both the potential equation and the Euler equations (39, 25). In our most recent paper, the Euler method was extended to treat complete aircraft configurations via a new multiblock implementation. Furthermore, during the same conference, we also presented preliminary results demonstrating that the basic methodology could be ported to distributed memory parallel computing architectures [241. In this paper, our concem will be to demonstrate that the combined power of these new technologies can be used routinely in an industrial design environment by applying it to the case study of the design of typical supersonic transport configurations. A particular difficulty of this test case is posed by the propulsion/airframe integration.
Multi-petascale highly efficient parallel supercomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Asaad, Sameh; Bellofatto, Ralph E.; Blocksome, Michael A.

A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaflop-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC). The ASIC nodes are interconnected by a five dimensional torus network that optimally maximize the throughput of packet communications between nodes and minimize latency. The network implements collective network and a global asynchronous network that provides global barrier and notification functions. Integrated in the node design include a list-based prefetcher. The memory system implements transaction memory, thread level speculation, and multiversioning cache that improves soft error rate at the same time andmore » supports DMA functionality allowing for parallel processing message-passing.« less

Application of a distributed network in computational fluid dynamic simulations

NASA Technical Reports Server (NTRS)

Deshpande, Manish; Feng, Jinzhang; Merkle, Charles L.; Deshpande, Ashish

1994-01-01

A general-purpose 3-D, incompressible Navier-Stokes algorithm is implemented on a network of concurrently operating workstations using parallel virtual machine (PVM) and compared with its performance on a CRAY Y-MP and on an Intel iPSC/860. The problem is relatively computationally intensive, and has a communication structure based primarily on nearest-neighbor communication, making it ideally suited to message passing. Such problems are frequently encountered in computational fluid dynamics (CDF), and their solution is increasingly in demand. The communication structure is explicitly coded in the implementation to fully exploit the regularity in message passing in order to produce a near-optimal solution. Results are presented for various grid sizes using up to eight processors.
Gauge-free cluster variational method by maximal messages and moment matching.

PubMed

Domínguez, Eduardo; Lage-Castellanos, Alejandro; Mulet, Roberto; Ricci-Tersenghi, Federico

2017-04-01

We present an implementation of the cluster variational method (CVM) as a message passing algorithm. The kind of message passing algorithm used for CVM, usually named generalized belief propagation (GBP), is a generalization of the belief propagation algorithm in the same way that CVM is a generalization of the Bethe approximation for estimating the partition function. However, the connection between fixed points of GBP and the extremal points of the CVM free energy is usually not a one-to-one correspondence because of the existence of a gauge transformation involving the GBP messages. Our contribution is twofold. First, we propose a way of defining messages (fields) in a generic CVM approximation, such that messages arrive on a given region from all its ancestors, and not only from its direct parents, as in the standard parent-to-child GBP. We call this approach maximal messages. Second, we focus on the case of binary variables, reinterpreting the messages as fields enforcing the consistency between the moments of the local (marginal) probability distributions. We provide a precise rule to enforce all consistencies, avoiding any redundancy, that would otherwise lead to a gauge transformation on the messages. This moment matching method is gauge free, i.e., it guarantees that the resulting GBP is not gauge invariant. We apply our maximal messages and moment matching GBP to obtain an analytical expression for the critical temperature of the Ising model in general dimensions at the level of plaquette CVM. The values obtained outperform Bethe estimates, and are comparable with loop corrected belief propagation equations. The method allows for a straightforward generalization to disordered systems.
Gauge-free cluster variational method by maximal messages and moment matching

NASA Astrophysics Data System (ADS)

Domínguez, Eduardo; Lage-Castellanos, Alejandro; Mulet, Roberto; Ricci-Tersenghi, Federico

2017-04-01

We present an implementation of the cluster variational method (CVM) as a message passing algorithm. The kind of message passing algorithm used for CVM, usually named generalized belief propagation (GBP), is a generalization of the belief propagation algorithm in the same way that CVM is a generalization of the Bethe approximation for estimating the partition function. However, the connection between fixed points of GBP and the extremal points of the CVM free energy is usually not a one-to-one correspondence because of the existence of a gauge transformation involving the GBP messages. Our contribution is twofold. First, we propose a way of defining messages (fields) in a generic CVM approximation, such that messages arrive on a given region from all its ancestors, and not only from its direct parents, as in the standard parent-to-child GBP. We call this approach maximal messages. Second, we focus on the case of binary variables, reinterpreting the messages as fields enforcing the consistency between the moments of the local (marginal) probability distributions. We provide a precise rule to enforce all consistencies, avoiding any redundancy, that would otherwise lead to a gauge transformation on the messages. This moment matching method is gauge free, i.e., it guarantees that the resulting GBP is not gauge invariant. We apply our maximal messages and moment matching GBP to obtain an analytical expression for the critical temperature of the Ising model in general dimensions at the level of plaquette CVM. The values obtained outperform Bethe estimates, and are comparable with loop corrected belief propagation equations. The method allows for a straightforward generalization to disordered systems.
NAS Parallel Benchmarks. 2.4

NASA Technical Reports Server (NTRS)

VanderWijngaart, Rob; Biegel, Bryan A. (Technical Monitor)

2002-01-01

We describe a new problem size, called Class D, for the NAS Parallel Benchmarks (NPB), whose MPI source code implementation is being released as NPB 2.4. A brief rationale is given for how the new class is derived. We also describe the modifications made to the MPI (Message Passing Interface) implementation to allow the new class to be run on systems with 32-bit integers, and with moderate amounts of memory. Finally, we give the verification values for the new problem size.
8755 Emulator Design

DTIC Science & Technology

1988-12-01

Break Control....................51 8755 I/O Control..................54 Z-100 Control Software................. 55 Pass User Memory...the emulator SRAM and the other is 55 for the target SRAM. If either signal is replaced by the NACK signal the host computer displays an error message...Block Diagram 69 AUU M/M 8 in Fiur[:. cemti Daga 70m F’M I-- ’I ANLAD AMam U52 COMM ow U2 i M.2 " -n ax- U 6_- Figure 2b. Schematic Diagram 71 )-M -A -I
Parallel SOR methods with a parabolic-diffusion acceleration technique for solving an unstructured-grid Poisson equation on 3D arbitrary geometries

NASA Astrophysics Data System (ADS)

Zapata, M. A. Uh; Van Bang, D. Pham; Nguyen, K. D.

2016-05-01

This paper presents a parallel algorithm for the finite-volume discretisation of the Poisson equation on three-dimensional arbitrary geometries. The proposed method is formulated by using a 2D horizontal block domain decomposition and interprocessor data communication techniques with message passing interface. The horizontal unstructured-grid cells are reordered according to the neighbouring relations and decomposed into blocks using a load-balanced distribution to give all processors an equal amount of elements. In this algorithm, two parallel successive over-relaxation methods are presented: a multi-colour ordering technique for unstructured grids based on distributed memory and a block method using reordering index following similar ideas of the partitioning for structured grids. In all cases, the parallel algorithms are implemented with a combination of an acceleration iterative solver. This solver is based on a parabolic-diffusion equation introduced to obtain faster solutions of the linear systems arising from the discretisation. Numerical results are given to evaluate the performances of the methods showing speedups better than linear.
Parallel computation and the Basis system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Smith, G.R.

1992-12-16

A software package has been written that can facilitate efforts to develop powerful, flexible, and easy-to-use programs that can run in single-processor, massively parallel, and distributed computing environments. Particular attention has been given to the difficulties posed by a program consisting of many science packages that represent subsystems of a complicated, coupled system. Methods have been found to maintain independence of the packages by hiding data structures without increasing the communication costs in a parallel computing environment. Concepts developed in this work are demonstrated by a prototype program that uses library routines from two existing software systems, Basis and Parallelmore » Virtual Machine (PVM). Most of the details of these libraries have been encapsulated in routines and macros that could be rewritten for alternative libraries that possess certain minimum capabilities. The prototype software uses a flexible master-and-slaves paradigm for parallel computation and supports domain decomposition with message passing for partitioning work among slaves. Facilities are provided for accessing variables that are distributed among the memories of slaves assigned to subdomains. The software is named PROTOPAR.« less
Parallel computation and the basis system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Smith, G.R.

1993-05-01

A software package has been written that can facilitate efforts to develop powerful, flexible, and easy-to use programs that can run in single-processor, massively parallel, and distributed computing environments. Particular attention has been given to the difficulties posed by a program consisting of many science packages that represent subsystems of a complicated, coupled system. Methods have been found to maintain independence of the packages by hiding data structures without increasing the communications costs in a parallel computing environment. Concepts developed in this work are demonstrated by a prototype program that uses library routines from two existing software systems, Basis andmore » Parallel Virtual Machine (PVM). Most of the details of these libraries have been encapsulated in routines and macros that could be rewritten for alternative libraries that possess certain minimum capabilities. The prototype software uses a flexible master-and-slaves paradigm for parallel computation and supports domain decomposition with message passing for partitioning work among slaves. Facilities are provided for accessing variables that are distributed among the memories of slaves assigned to subdomains. The software is named PROTOPAR.« less
Parallel Implementation of Triangular Cellular Automata for Computing Two-Dimensional Elastodynamic Response on Arbitrary Domains

NASA Astrophysics Data System (ADS)

Leamy, Michael J.; Springer, Adam C.

In this research we report parallel implementation of a Cellular Automata-based simulation tool for computing elastodynamic response on complex, two-dimensional domains. Elastodynamic simulation using Cellular Automata (CA) has recently been presented as an alternative, inherently object-oriented technique for accurately and efficiently computing linear and nonlinear wave propagation in arbitrarily-shaped geometries. The local, autonomous nature of the method should lead to straight-forward and efficient parallelization. We address this notion on symmetric multiprocessor (SMP) hardware using a Java-based object-oriented CA code implementing triangular state machines (i.e., automata) and the MPI bindings written in Java (MPJ Express). We use MPJ Express to reconfigure our existing CA code to distribute a domain's automata to cores present on a dual quad-core shared-memory system (eight total processors). We note that this message passing parallelization strategy is directly applicable to computer clustered computing, which will be the focus of follow-on research. Results on the shared memory platform indicate nearly-ideal, linear speed-up. We conclude that the CA-based elastodynamic simulator is easily configured to run in parallel, and yields excellent speed-up on SMP hardware.
Implementation of a Parallel Kalman Filter for Stratospheric Chemical Tracer Assimilation

NASA Technical Reports Server (NTRS)

Chang, Lang-Ping; Lyster, Peter M.; Menard, R.; Cohn, S. E.

1998-01-01

A Kalman filter for the assimilation of long-lived atmospheric chemical constituents has been developed for two-dimensional transport models on isentropic surfaces over the globe. An important attribute of the Kalman filter is that it calculates error covariances of the constituent fields using the tracer dynamics. Consequently, the current Kalman-filter assimilation is a five-dimensional problem (coordinates of two points and time), and it can only be handled on computers with large memory and high floating point speed. In this paper, an implementation of the Kalman filter for distributed-memory, message-passing parallel computers is discussed. Two approaches were studied: an operator decomposition and a covariance decomposition. The latter was found to be more scalable than the former, and it possesses the property that the dynamical model does not need to be parallelized, which is of considerable practical advantage. This code is currently used to assimilate constituent data retrieved by limb sounders on the Upper Atmosphere Research Satellite. Tests of the code examined the variance transport and observability properties. Aspects of the parallel implementation, some timing results, and a brief discussion of the physical results will be presented.
Flow of Emotional Messages in Artificial Social Networks

NASA Astrophysics Data System (ADS)

Chmiel, Anna; Hołyst, Janusz A.

Models of message flows in an artificial group of users communicating via the Internet are introduced and investigated using numerical simulations. We assumed that messages possess an emotional character with a positive valence and that the willingness to send the next affective message to a given person increases with the number of messages received from this person. As a result, the weights of links between group members evolve over time. Memory effects are introduced, taking into account that the preferential selection of message receivers depends on the communication intensity during the recent period only. We also model the phenomenon of secondary social sharing when the reception of an emotional e-mail triggers the distribution of several emotional e-mails to other people.
Message Passing and Shared Address Space Parallelism on an SMP Cluster

NASA Technical Reports Server (NTRS)

Shan, Hongzhang; Singh, Jaswinder P.; Oliker, Leonid; Biswas, Rupak; Biegel, Bryan (Technical Monitor)

2002-01-01

Currently, message passing (MP) and shared address space (SAS) are the two leading parallel programming paradigms. MP has been standardized with MPI, and is the more common and mature approach; however, code development can be extremely difficult, especially for irregularly structured computations. SAS offers substantial ease of programming, but may suffer from performance limitations due to poor spatial locality and high protocol overhead. In this paper, we compare the performance of and the programming effort required for six applications under both programming models on a 32-processor PC-SMP cluster, a platform that is becoming increasingly attractive for high-end scientific computing. Our application suite consists of codes that typically do not exhibit scalable performance under shared-memory programming due to their high communication-to-computation ratios and/or complex communication patterns. Results indicate that SAS can achieve about half the parallel efficiency of MPI for most of our applications, while being competitive for the others. A hybrid MPI+SAS strategy shows only a small performance advantage over pure MPI in some cases. Finally, improved implementations of two MPI collective operations on PC-SMP clusters are presented.
Message passing for integrating and assessing renewable generation in a redundant power grid

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zdeborova, Lenka; Backhaus, Scott; Chertkov, Michael

2009-01-01

A simplified model of a redundant power grid is used to study integration of fluctuating renewable generation. The grid consists of large number of generator and consumer nodes. The net power consumption is determined by the difference between the gross consumption and the level of renewable generation. The gross consumption is drawn from a narrow distribution representing the predictability of aggregated loads, and we consider two different distributions representing wind and solar resources. Each generator is connected to D consumers, and redundancy is built in by connecting R {le} D of these consumers to other generators. The lines are switchablemore » so that at any instance each consumer is connected to a single generator. We explore the capacity of the renewable generation by determining the level of 'firm' generation capacity that can be displaced for different levels of redundancy R. We also develop message-passing control algorithm for finding switch sellings where no generator is overloaded.« less
Performance analysis of distributed applications using automatic classification of communication inefficiencies

DOEpatents

Vetter, Jeffrey S.

2005-02-01

The method and system described herein presents a technique for performance analysis that helps users understand the communication behavior of their message passing applications. The method and system described herein may automatically classifies individual communication operations and reveal the cause of communication inefficiencies in the application. This classification allows the developer to quickly focus on the culprits of truly inefficient behavior, rather than manually foraging through massive amounts of performance data. Specifically, the method and system described herein trace the message operations of Message Passing Interface (MPI) applications and then classify each individual communication event using a supervised learning technique: decision tree classification. The decision tree may be trained using microbenchmarks that demonstrate both efficient and inefficient communication. Since the method and system described herein adapt to the target system's configuration through these microbenchmarks, they simultaneously automate the performance analysis process and improve classification accuracy. The method and system described herein may improve the accuracy of performance analysis and dramatically reduce the amount of data that users must encounter.
Extended write combining using a write continuation hint flag

DOEpatents

Chen, Dong; Gara, Alan; Heidelberger, Philip; Ohmacht, Martin; Vranas, Pavlos

2013-06-04

A computing apparatus for reducing the amount of processing in a network computing system which includes a network system device of a receiving node for receiving electronic messages comprising data. The electronic messages are transmitted from a sending node. The network system device determines when more data of a specific electronic message is being transmitted. A memory device stores the electronic message data and communicating with the network system device. A memory subsystem communicates with the memory device. The memory subsystem stores a portion of the electronic message when more data of the specific message will be received, and the buffer combines the portion with later received data and moves the data to the memory device for accessible storage.
Managing internode data communications for an uninitialized process in a parallel computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Archer, Charles J; Blocksome, Michael A; Miller, Douglas R

2014-05-20

A parallel computer includes nodes, each having main memory and a messaging unit (MU). Each MU includes computer memory, which in turn includes, MU message buffers. Each MU message buffer is associated with an uninitialized process on the compute node. In the parallel computer, managing internode data communications for an uninitialized process includes: receiving, by an MU of a compute node, one or more data communications messages in an MU message buffer associated with an uninitialized process on the compute node; determining, by an application agent, that the MU message buffer associated with the uninitialized process is full prior tomore » initialization of the uninitialized process; establishing, by the application agent, a temporary message buffer for the uninitialized process in main computer memory; and moving, by the application agent, data communications messages from the MU message buffer associated with the uninitialized process to the temporary message buffer in main computer memory.« less
Managing internode data communications for an uninitialized process in a parallel computer

DOEpatents

Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Parker, Jeffrey J; Ratterman, Joseph D; Smith, Brian E

2014-05-20

A parallel computer includes nodes, each having main memory and a messaging unit (MU). Each MU includes computer memory, which in turn includes, MU message buffers. Each MU message buffer is associated with an uninitialized process on the compute node. In the parallel computer, managing internode data communications for an uninitialized process includes: receiving, by an MU of a compute node, one or more data communications messages in an MU message buffer associated with an uninitialized process on the compute node; determining, by an application agent, that the MU message buffer associated with the uninitialized process is full prior to initialization of the uninitialized process; establishing, by the application agent, a temporary message buffer for the uninitialized process in main computer memory; and moving, by the application agent, data communications messages from the MU message buffer associated with the uninitialized process to the temporary message buffer in main computer memory.
Trellis Tone Modulation Multiple-Access for Peer Discovery in D2D Networks

PubMed Central

Lim, Chiwoo; Kim, Sang-Hyo

2018-01-01

In this paper, a new non-orthogonal multiple-access scheme, trellis tone modulation multiple-access (TTMMA), is proposed for peer discovery of distributed device-to-device (D2D) communication. The range and capacity of discovery are important performance metrics in peer discovery. The proposed trellis tone modulation uses single-tone transmission and achieves a long discovery range due to its low Peak-to-Average Power Ratio (PAPR). The TTMMA also exploits non-orthogonal resource assignment to increase the discovery capacity. For the multi-user detection of superposed multiple-access signals, a message-passing algorithm with supplementary schemes are proposed. With TTMMA and its message-passing demodulation, approximately 1.5 times the number of devices are discovered compared to the conventional frequency division multiple-access (FDMA)-based discovery. PMID:29673167
Trellis Tone Modulation Multiple-Access for Peer Discovery in D2D Networks.

PubMed

Lim, Chiwoo; Jang, Min; Kim, Sang-Hyo

2018-04-17

In this paper, a new non-orthogonal multiple-access scheme, trellis tone modulation multiple-access (TTMMA), is proposed for peer discovery of distributed device-to-device (D2D) communication. The range and capacity of discovery are important performance metrics in peer discovery. The proposed trellis tone modulation uses single-tone transmission and achieves a long discovery range due to its low Peak-to-Average Power Ratio (PAPR). The TTMMA also exploits non-orthogonal resource assignment to increase the discovery capacity. For the multi-user detection of superposed multiple-access signals, a message-passing algorithm with supplementary schemes are proposed. With TTMMA and its message-passing demodulation, approximately 1.5 times the number of devices are discovered compared to the conventional frequency division multiple-access (FDMA)-based discovery.
Preventing messaging queue deadlocks in a DMA environment

DOEpatents

Blocksome, Michael A; Chen, Dong; Gooding, Thomas; Heidelberger, Philip; Parker, Jeff

2014-01-14

Embodiments of the invention may be used to manage message queues in a parallel computing environment to prevent message queue deadlock. A direct memory access controller of a compute node may determine when a messaging queue is full. In response, the DMA may generate and interrupt. An interrupt handler may stop the DMA and swap all descriptors from the full messaging queue into a larger queue (or enlarge the original queue). The interrupt handler then restarts the DMA. Alternatively, the interrupt handler stops the DMA, allocates a memory block to hold queue data, and then moves descriptors from the full messaging queue into the allocated memory block. The interrupt handler then restarts the DMA. During a normal messaging advance cycle, a messaging manager attempts to inject the descriptors in the memory block into other messaging queues until the descriptors have all been processed.

Managing coherence via put/get windows

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY

2011-01-11

A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Managing coherence via put/get windows

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY

2012-02-21

A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Passing crisis and emergency risk communications: the effects of communication channel, information type, and repetition.

PubMed

Edworthy, Judy; Hellier, Elizabeth; Newbold, Lex; Titchener, Kirsteen

2015-05-01

Three experiments explore several factors which influence information transmission when warning messages are passed from person to person. In Experiment 1, messages were passed down chains of participants using five different modes of communication. Written communication channels resulted in more accurate message transmission than verbal. In addition, some elements of the message endured further down the chain than others. Experiment 2 largely replicated these effects and also demonstrated that simple repetition of a message eliminated differences between written and spoken communication. In a final field experiment, chains of participants passed information however they wanted to, with the proviso that half of the chains could not use telephones. Here, the lack of ability to use a telephone did not affect accuracy, but did slow down the speed of transmission from the recipient of the message to the last person in the chain. Implications of the findings for crisis and emergency risk communication are discussed. Copyright © 2015 Elsevier Ltd and The Ergonomics Society. All rights reserved.
Increasing available FIFO space to prevent messaging queue deadlocks in a DMA environment

DOEpatents

Blocksome, Michael A [Rochester, MN; Chen, Dong [Croton On Hudson, NY; Gooding, Thomas [Rochester, MN; Heidelberger, Philip [Cortlandt Manor, NY; Parker, Jeff [Rochester, MN

2012-02-07

Embodiments of the invention may be used to manage message queues in a parallel computing environment to prevent message queue deadlock. A direct memory access controller of a compute node may determine when a messaging queue is full. In response, the DMA may generate an interrupt. An interrupt handler may stop the DMA and swap all descriptors from the full messaging queue into a larger queue (or enlarge the original queue). The interrupt handler then restarts the DMA. Alternatively, the interrupt handler stops the DMA, allocates a memory block to hold queue data, and then moves descriptors from the full messaging queue into the allocated memory block. The interrupt handler then restarts the DMA. During a normal messaging advance cycle, a messaging manager attempts to inject the descriptors in the memory block into other messaging queues until the descriptors have all been processed.
The serial message-passing schedule for LDPC decoding algorithms

NASA Astrophysics Data System (ADS)

Liu, Mingshan; Liu, Shanshan; Zhou, Yuan; Jiang, Xue

2015-12-01

The conventional message-passing schedule for LDPC decoding algorithms is the so-called flooding schedule. It has the disadvantage that the updated messages cannot be used until next iteration, thus reducing the convergence speed . In this case, the Layered Decoding algorithm (LBP) based on serial message-passing schedule is proposed. In this paper the decoding principle of LBP algorithm is briefly introduced, and then proposed its two improved algorithms, the grouped serial decoding algorithm (Grouped LBP) and the semi-serial decoding algorithm .They can improve LBP algorithm's decoding speed while maintaining a good decoding performance.
Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Buntinas, D.; Mercier, G.; Gropp, W.

2005-12-02

This paper presents a new low-level communication subsystem called Nemesis. Nemesis has been designed and implemented to be scalable and efficient both in the intranode communication context using shared-memory and in the internode communication case using high-performance networks and is natively multimethod-enabled. Nemesis has been integrated in MPICH2 as a CH3 channel and delivers better performance than other dedicated communication channels in MPICH2. Furthermore, the resulting MPICH2 architecture outperforms other MPI implementations in point-to-point benchmarks.
Multiplexed memory-insensitive quantum repeaters.

PubMed

Collins, O A; Jenkins, S D; Kuzmich, A; Kennedy, T A B

2007-02-09

Long-distance quantum communication via distant pairs of entangled quantum bits (qubits) is the first step towards secure message transmission and distributed quantum computing. To date, the most promising proposals require quantum repeaters to mitigate the exponential decrease in communication rate due to optical fiber losses. However, these are exquisitely sensitive to the lifetimes of their memory elements. We propose a multiplexing of quantum nodes that should enable the construction of quantum networks that are largely insensitive to the coherence times of the quantum memory elements.
Tough2{_}MP: A parallel version of TOUGH2

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Keni; Wu, Yu-Shu; Ding, Chris

2003-04-09

TOUGH2{_}MP is a massively parallel version of TOUGH2. It was developed for running on distributed-memory parallel computers to simulate large simulation problems that may not be solved by the standard, single-CPU TOUGH2 code. The new code implements an efficient massively parallel scheme, while preserving the full capacity and flexibility of the original TOUGH2 code. The new software uses the METIS software package for grid partitioning and AZTEC software package for linear-equation solving. The standard message-passing interface is adopted for communication among processors. Numerical performance of the current version code has been tested on CRAY-T3E and IBM RS/6000 SP platforms. Inmore » addition, the parallel code has been successfully applied to real field problems of multi-million-cell simulations for three-dimensional multiphase and multicomponent fluid and heat flow, as well as solute transport. In this paper, we will review the development of the TOUGH2{_}MP, and discuss the basic features, modules, and their applications.« less
MPI, HPF or OpenMP: A Study with the NAS Benchmarks

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Frumkin, Michael; Hribar, Michelle; Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)

1999-01-01

Porting applications to new high performance parallel and distributed platforms is a challenging task. Writing parallel code by hand is time consuming and costly, but the task can be simplified by high level languages and would even better be automated by parallelizing tools and compilers. The definition of HPF (High Performance Fortran, based on data parallel model) and OpenMP (based on shared memory parallel model) standards has offered great opportunity in this respect. Both provide simple and clear interfaces to language like FORTRAN and simplify many tedious tasks encountered in writing message passing programs. In our study we implemented the parallel versions of the NAS Benchmarks with HPF and OpenMP directives. Comparison of their performance with the MPI implementation and pros and cons of different approaches will be discussed along with experience of using computer-aided tools to help parallelize these benchmarks. Based on the study,potentials of applying some of the techniques to realistic aerospace applications will be presented
MPI, HPF or OpenMP: A Study with the NAS Benchmarks

NASA Technical Reports Server (NTRS)

Jin, H.; Frumkin, M.; Hribar, M.; Waheed, A.; Yan, J.; Saini, Subhash (Technical Monitor)

1999-01-01

Porting applications to new high performance parallel and distributed platforms is a challenging task. Writing parallel code by hand is time consuming and costly, but this task can be simplified by high level languages and would even better be automated by parallelizing tools and compilers. The definition of HPF (High Performance Fortran, based on data parallel model) and OpenMP (based on shared memory parallel model) standards has offered great opportunity in this respect. Both provide simple and clear interfaces to language like FORTRAN and simplify many tedious tasks encountered in writing message passing programs. In our study, we implemented the parallel versions of the NAS Benchmarks with HPF and OpenMP directives. Comparison of their performance with the MPI implementation and pros and cons of different approaches will be discussed along with experience of using computer-aided tools to help parallelize these benchmarks. Based on the study, potentials of applying some of the techniques to realistic aerospace applications will be presented.
Extending the length and time scales of Gram-Schmidt Lyapunov vector computations

NASA Astrophysics Data System (ADS)

Costa, Anthony B.; Green, Jason R.

2013-08-01

Lyapunov vectors have found growing interest recently due to their ability to characterize systems out of thermodynamic equilibrium. The computation of orthogonal Gram-Schmidt vectors requires multiplication and QR decomposition of large matrices, which grow as N2 (with the particle count). This expense has limited such calculations to relatively small systems and short time scales. Here, we detail two implementations of an algorithm for computing Gram-Schmidt vectors. The first is a distributed-memory message-passing method using Scalapack. The second uses the newly-released MAGMA library for GPUs. We compare the performance of both codes for Lennard-Jones fluids from N=100 to 1300 between Intel Nahalem/Infiniband DDR and NVIDIA C2050 architectures. To our best knowledge, these are the largest systems for which the Gram-Schmidt Lyapunov vectors have been computed, and the first time their calculation has been GPU-accelerated. We conclude that Lyapunov vector calculations can be significantly extended in length and time by leveraging the power of GPU-accelerated linear algebra.
Parallel Fortran-MPI software for numerical inversion of the Laplace transform and its application to oscillatory water levels in groundwater environments

USGS Publications Warehouse

Zhan, X.

2005-01-01

A parallel Fortran-MPI (Message Passing Interface) software for numerical inversion of the Laplace transform based on a Fourier series method is developed to meet the need of solving intensive computational problems involving oscillatory water level's response to hydraulic tests in a groundwater environment. The software is a parallel version of ACM (The Association for Computing Machinery) Transactions on Mathematical Software (TOMS) Algorithm 796. Running 38 test examples indicated that implementation of MPI techniques with distributed memory architecture speedups the processing and improves the efficiency. Applications to oscillatory water levels in a well during aquifer tests are presented to illustrate how this package can be applied to solve complicated environmental problems involved in differential and integral equations. The package is free and is easy to use for people with little or no previous experience in using MPI but who wish to get off to a quick start in parallel computing. ?? 2004 Elsevier Ltd. All rights reserved.
Parallel implementation of a Lagrangian-based model on an adaptive mesh in C++: Application to sea-ice

NASA Astrophysics Data System (ADS)

Samaké, Abdoulaye; Rampal, Pierre; Bouillon, Sylvain; Ólason, Einar

2017-12-01

We present a parallel implementation framework for a new dynamic/thermodynamic sea-ice model, called neXtSIM, based on the Elasto-Brittle rheology and using an adaptive mesh. The spatial discretisation of the model is done using the finite-element method. The temporal discretisation is semi-implicit and the advection is achieved using either a pure Lagrangian scheme or an Arbitrary Lagrangian Eulerian scheme (ALE). The parallel implementation presented here focuses on the distributed-memory approach using the message-passing library MPI. The efficiency and the scalability of the parallel algorithms are illustrated by the numerical experiments performed using up to 500 processor cores of a cluster computing system. The performance obtained by the proposed parallel implementation of the neXtSIM code is shown being sufficient to perform simulations for state-of-the-art sea ice forecasting and geophysical process studies over geographical domain of several millions squared kilometers like the Arctic region.
Performance Comparison of a Matrix Solver on a Heterogeneous Network Using Two Implementations of MPI: MPICH and LAM

NASA Technical Reports Server (NTRS)

Phillips, Jennifer K.

1995-01-01

Two of the current and most popular implementations of the Message-Passing Standard, Message Passing Interface (MPI), were contrasted: MPICH by Argonne National Laboratory, and LAM by the Ohio Supercomputer Center at Ohio State University. A parallel skyline matrix solver was adapted to be run in a heterogeneous environment using MPI. The Message-Passing Interface Forum was held in May 1994 which lead to a specification of library functions that implement the message-passing model of parallel communication. LAM, which creates it's own environment, is more robust in a highly heterogeneous network. MPICH uses the environment native to the machine architecture. While neither of these free-ware implementations provides the performance of native message-passing or vendor's implementations, MPICH begins to approach that performance on the SP-2. The machines used in this study were: IBM RS6000, 3 Sun4, SGI, and the IBM SP-2. Each machine is unique and a few machines required specific modifications during the installation. When installed correctly, both implementations worked well with only minor problems.
Parallelization of the TRIGRS model for rainfall-induced landslides using the message passing interface

USGS Publications Warehouse

Alvioli, M.; Baum, R.L.

2016-01-01

We describe a parallel implementation of TRIGRS, the Transient Rainfall Infiltration and Grid-Based Regional Slope-Stability Model for the timing and distribution of rainfall-induced shallow landslides. We have parallelized the four time-demanding execution modes of TRIGRS, namely both the saturated and unsaturated model with finite and infinite soil depth options, within the Message Passing Interface framework. In addition to new features of the code, we outline details of the parallel implementation and show the performance gain with respect to the serial code. Results are obtained both on commercial hardware and on a high-performance multi-node machine, showing the different limits of applicability of the new code. We also discuss the implications for the application of the model on large-scale areas and as a tool for real-time landslide hazard monitoring.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Gorentla Venkata, Manjunath; Graham, Richard L; Ladd, Joshua S

This paper describes the design and implementation of InfiniBand (IB) CORE-Direct based blocking and nonblocking broadcast operations within the Cheetah collective operation framework. It describes a novel approach that fully ofFLoads collective operations and employs only user-supplied buffers. For a 64 rank communicator, the latency of CORE-Direct based hierarchical algorithm is better than production-grade Message Passing Interface (MPI) implementations, 150% better than the default Open MPI algorithm and 115% better than the shared memory optimized MVAPICH implementation for a one kilobyte (KB) message, and for eight mega-bytes (MB) it is 48% and 64% better, respectively. Flat-topology broadcast achieves 99.9% overlapmore » in a polling based communication-computation test, and 95.1% overlap for a wait based test, compared with 92.4% and 17.0%, respectively, for a similar Central Processing Unit (CPU) based implementation.« less
Saguaro: A Distributed Operating System Based on Pools of Servers.

DTIC Science & Technology

1988-03-25

asynchronous message passing, multicast, and semaphores are supported. We have found this flexibility to be very useful for distributed programming. The...variety of communication primitives provided by SR has facilitated the research of Stella Atkins, who was a visiting professor at Arizona during Spring...data bits in a raw communication channel to help keep the source and destination synchronized , Psync explicitly embeds timing information drawn from the
Xyce parallel electronic simulator users guide, version 6.1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R; Mei, Ting; Russo, Thomas V.

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas; Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers; A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models; Device models that are specifically tailored to meet Sandia's needs, including some radiationaware devices (for Sandia users only); and Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase-a message passing parallel implementation-which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Xyce parallel electronic simulator users' guide, Version 6.0.1.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R; Mei, Ting; Russo, Thomas V.

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Xyce parallel electronic simulator users guide, version 6.0.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R; Mei, Ting; Russo, Thomas V.

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less

Implementing Multidisciplinary and Multi-Zonal Applications Using MPI

NASA Technical Reports Server (NTRS)

Fineberg, Samuel A.

1995-01-01

Multidisciplinary and multi-zonal applications are an important class of applications in the area of Computational Aerosciences. In these codes, two or more distinct parallel programs or copies of a single program are utilized to model a single problem. To support such applications, it is common to use a programming model where a program is divided into several single program multiple data stream (SPMD) applications, each of which solves the equations for a single physical discipline or grid zone. These SPMD applications are then bound together to form a single multidisciplinary or multi-zonal program in which the constituent parts communicate via point-to-point message passing routines. Unfortunately, simple message passing models, like Intel's NX library, only allow point-to-point and global communication within a single system-defined partition. This makes implementation of these applications quite difficult, if not impossible. In this report it is shown that the new Message Passing Interface (MPI) standard is a viable portable library for implementing the message passing portion of multidisciplinary applications. Further, with the extension of a portable loader, fully portable multidisciplinary application programs can be developed. Finally, the performance of MPI is compared to that of some native message passing libraries. This comparison shows that MPI can be implemented to deliver performance commensurate with native message libraries.
Driver memory for in-vehicle visual and auditory messages

DOT National Transportation Integrated Search

1999-12-01

Three experiments were conducted in a driving simulator to evaluate effects of in-vehicle message modality and message format on comprehension and memory for younger and older drivers. Visual icons and text messages were effective in terms of high co...
How Communication Goals Determine when Audience Tuning Biases Memory

ERIC Educational Resources Information Center

Echterhoff, Gerald; Higgins, E. Tory; Kopietz, Rene; Groll, Stephan

2008-01-01

After tuning their message to suit their audience's attitude, communicators' own memories for the original information (e.g., a target person's behaviors) often reflect the biased view expressed in their message--producing an audience-congruent memory bias. Exploring the motivational circumstances of message production, the authors investigated…
Simplifying and speeding the management of intra-node cache coherence

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Phillip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY

2012-04-17

A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Managing coherence via put/get windows

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blumrich, Matthias A; Chen, Dong; Coteus, Paul W

A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an areamore » of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.« less
Efficient Implementation of Multigrid Solvers on Message-Passing Parrallel Systems

NASA Technical Reports Server (NTRS)

Lou, John

1994-01-01

We discuss our implementation strategies for finite difference multigrid partial differential equation (PDE) solvers on message-passing systems. Our target parallel architecture is Intel parallel computers: the Delta and Paragon system.
SAHAYOG: A Testbed for Load Sharing under Failure,

DTIC Science & Technology

1987-07-01

messages, shared memory and semaphores . To communicate using messages, processes create message queues using system-provided prim- itives. The message...The size of the memory that is to be shared is decided by the process when it makes a request for memory allocation. The semaphore option of IPC can be...used to prevent two or more concurrent processes from executing their critical sections at the same time. Semaphores must be used when the processes
Parallelization of a hydrological model using the message passing interface

USGS Publications Warehouse

Wu, Yiping; Li, Tiejian; Sun, Liqun; Chen, Ji

2013-01-01

With the increasing knowledge about the natural processes, hydrological models such as the Soil and Water Assessment Tool (SWAT) are becoming larger and more complex with increasing computation time. Additionally, other procedures such as model calibration, which may require thousands of model iterations, can increase running time and thus further reduce rapid modeling and analysis. Using the widely-applied SWAT as an example, this study demonstrates how to parallelize a serial hydrological model in a Windows® environment using a parallel programing technology—Message Passing Interface (MPI). With a case study, we derived the optimal values for the two parameters (the number of processes and the corresponding percentage of work to be distributed to the master process) of the parallel SWAT (P-SWAT) on an ordinary personal computer and a work station. Our study indicates that model execution time can be reduced by 42%–70% (or a speedup of 1.74–3.36) using multiple processes (two to five) with a proper task-distribution scheme (between the master and slave processes). Although the computation time cost becomes lower with an increasing number of processes (from two to five), this enhancement becomes less due to the accompanied increase in demand for message passing procedures between the master and all slave processes. Our case study demonstrates that the P-SWAT with a five-process run may reach the maximum speedup, and the performance can be quite stable (fairly independent of a project size). Overall, the P-SWAT can help reduce the computation time substantially for an individual model run, manual and automatic calibration procedures, and optimization of best management practices. In particular, the parallelization method we used and the scheme for deriving the optimal parameters in this study can be valuable and easily applied to other hydrological or environmental models.
Statistical analysis of loopy belief propagation in random fields

NASA Astrophysics Data System (ADS)

Yasuda, Muneki; Kataoka, Shun; Tanaka, Kazuyuki

2015-10-01

Loopy belief propagation (LBP), which is equivalent to the Bethe approximation in statistical mechanics, is a message-passing-type inference method that is widely used to analyze systems based on Markov random fields (MRFs). In this paper, we propose a message-passing-type method to analytically evaluate the quenched average of LBP in random fields by using the replica cluster variation method. The proposed analytical method is applicable to general pairwise MRFs with random fields whose distributions differ from each other and can give the quenched averages of the Bethe free energies over random fields, which are consistent with numerical results. The order of its computational cost is equivalent to that of standard LBP. In the latter part of this paper, we describe the application of the proposed method to Bayesian image restoration, in which we observed that our theoretical results are in good agreement with the numerical results for natural images.
Performance Analysis of Distributed Object-Oriented Applications

NASA Technical Reports Server (NTRS)

Schoeffler, James D.

1998-01-01

The purpose of this research was to evaluate the efficiency of a distributed simulation architecture which creates individual modules which are made self-scheduling through the use of a message-based communication system used for requesting input data from another module which is the source of that data. To make the architecture as general as possible, the message-based communication architecture was implemented using standard remote object architectures (Common Object Request Broker Architecture (CORBA) and/or Distributed Component Object Model (DCOM)). A series of experiments were run in which different systems are distributed in a variety of ways across multiple computers and the performance evaluated. The experiments were duplicated in each case so that the overhead due to message communication and data transmission can be separated from the time required to actually perform the computational update of a module each iteration. The software used to distribute the modules across multiple computers was developed in the first year of the current grant and was modified considerably to add a message-based communication scheme supported by the DCOM distributed object architecture. The resulting performance was analyzed using a model created during the first year of this grant which predicts the overhead due to CORBA and DCOM remote procedure calls and includes the effects of data passed to and from the remote objects. A report covering the distributed simulation software and the results of the performance experiments has been submitted separately. The above report also discusses possible future work to apply the methodology to dynamically distribute the simulation modules so as to minimize overall computation time.
An implementation and evaluation of the MPI 3.0 one-sided communication interface

DOE PAGES

Dinan, James S.; Balaji, Pavan; Buntinas, Darius T.; ...

2016-01-09

The Q1 Message Passing Interface (MPI) 3.0 standard includes a significant revision to MPI’s remote memory access (RMA) interface, which provides support for one-sided communication. MPI-3 RMA is expected to greatly enhance the usability and performance ofMPI RMA.We present the first complete implementation of MPI-3 RMA and document implementation techniques and performance optimization opportunities enabled by the new interface. Our implementation targets messaging-based networks and is publicly available in the latest release of the MPICH MPI implementation. Here using this implementation, we explore the performance impact of new MPI-3 functionality and semantics. Results indicate that the MPI-3 RMA interface providesmore » significant advantages over the MPI-2 interface by enabling increased communication concurrency through relaxed semantics in the interface and additional routines that provide new window types, synchronization modes, and atomic operations.« less
An implementation and evaluation of the MPI 3.0 one-sided communication interface

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dinan, James S.; Balaji, Pavan; Buntinas, Darius T.

The Q1 Message Passing Interface (MPI) 3.0 standard includes a significant revision to MPI’s remote memory access (RMA) interface, which provides support for one-sided communication. MPI-3 RMA is expected to greatly enhance the usability and performance ofMPI RMA.We present the first complete implementation of MPI-3 RMA and document implementation techniques and performance optimization opportunities enabled by the new interface. Our implementation targets messaging-based networks and is publicly available in the latest release of the MPICH MPI implementation. Here using this implementation, we explore the performance impact of new MPI-3 functionality and semantics. Results indicate that the MPI-3 RMA interface providesmore » significant advantages over the MPI-2 interface by enabling increased communication concurrency through relaxed semantics in the interface and additional routines that provide new window types, synchronization modes, and atomic operations.« less
Comparison of neuronal spike exchange methods on a Blue Gene/P supercomputer.

PubMed

Hines, Michael; Kumar, Sameer; Schürmann, Felix

2011-01-01

For neural network simulations on parallel machines, interprocessor spike communication can be a significant portion of the total simulation time. The performance of several spike exchange methods using a Blue Gene/P (BG/P) supercomputer has been tested with 8-128 K cores using randomly connected networks of up to 32 M cells with 1 k connections per cell and 4 M cells with 10 k connections per cell, i.e., on the order of 4·10(10) connections (K is 1024, M is 1024(2), and k is 1000). The spike exchange methods used are the standard Message Passing Interface (MPI) collective, MPI_Allgather, and several variants of the non-blocking Multisend method either implemented via non-blocking MPI_Isend, or exploiting the possibility of very low overhead direct memory access (DMA) communication available on the BG/P. In all cases, the worst performing method was that using MPI_Isend due to the high overhead of initiating a spike communication. The two best performing methods-the persistent Multisend method using the Record-Replay feature of the Deep Computing Messaging Framework DCMF_Multicast; and a two-phase multisend in which a DCMF_Multicast is used to first send to a subset of phase one destination cores, which then pass it on to their subset of phase two destination cores-had similar performance with very low overhead for the initiation of spike communication. Departure from ideal scaling for the Multisend methods is almost completely due to load imbalance caused by the large variation in number of cells that fire on each processor in the interval between synchronization. Spike exchange time itself is negligible since transmission overlaps with computation and is handled by a DMA controller. We conclude that ideal performance scaling will be ultimately limited by imbalance between incoming processor spikes between synchronization intervals. Thus, counterintuitively, maximization of load balance requires that the distribution of cells on processors should not reflect neural net architecture but be randomly distributed so that sets of cells which are burst firing together should be on different processors with their targets on as large a set of processors as possible.
A Down-to-Earth Educational Operating System for Up-in-the-Cloud Many-Core Architectures

ERIC Educational Resources Information Center

Ziwisky, Michael; Persohn, Kyle; Brylow, Dennis

2013-01-01

We present "Xipx," the first port of a major educational operating system to a processor in the emerging class of many-core architectures. Through extensions to the proven Embedded Xinu operating system, Xipx gives students hands-on experience with system programming in a distributed message-passing environment. We expose the software primitives…
Message passing with parallel queue traversal

DOEpatents

Underwood, Keith D [Albuquerque, NM; Brightwell, Ronald B [Albuquerque, NM; Hemmert, K Scott [Albuquerque, NM

2012-05-01

In message passing implementations, associative matching structures are used to permit list entries to be searched in parallel fashion, thereby avoiding the delay of linear list traversal. List management capabilities are provided to support list entry turnover semantics and priority ordering semantics.
Driver memory retention of in-vehicle information system messages

DOT National Transportation Integrated Search

1997-01-01

Memory retention of drivers was tested for traffic- and traveler-related messages displayed on an in-vehicle information system (IVIS). Three research questions were asked: (a) How does in-vehicle visual message format affect comprehension? (b) How d...
Distributed Systems Technology Survey.

DTIC Science & Technology

1987-03-01

and prolocols. 2. Hardware Technology Ecnomic factor we a majo reonm for the prolierat of dlstbted systoe. Processors, memory, an magne tc ndoptical...destined messages and pertorn the a pro te forwarding. There gImsno agreement that a ightweight process mechanism is essential to support com- monly used...Xerox PARC environment [311. Shared file servers, discussed below, are essential to the success of such a scheme. 11. ecurlity A distributed
Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Biswas, Rupak

1999-01-01

The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.
Software architecture for time-constrained machine vision applications

NASA Astrophysics Data System (ADS)

Usamentiaga, Rubén; Molleda, Julio; García, Daniel F.; Bulnes, Francisco G.

2013-01-01

Real-time image and video processing applications require skilled architects, and recent trends in the hardware platform make the design and implementation of these applications increasingly complex. Many frameworks and libraries have been proposed or commercialized to simplify the design and tuning of real-time image processing applications. However, they tend to lack flexibility, because they are normally oriented toward particular types of applications, or they impose specific data processing models such as the pipeline. Other issues include large memory footprints, difficulty for reuse, and inefficient execution on multicore processors. We present a novel software architecture for time-constrained machine vision applications that addresses these issues. The architecture is divided into three layers. The platform abstraction layer provides a high-level application programming interface for the rest of the architecture. The messaging layer provides a message-passing interface based on a dynamic publish/subscribe pattern. A topic-based filtering in which messages are published to topics is used to route the messages from the publishers to the subscribers interested in a particular type of message. The application layer provides a repository for reusable application modules designed for machine vision applications. These modules, which include acquisition, visualization, communication, user interface, and data processing, take advantage of the power of well-known libraries such as OpenCV, Intel IPP, or CUDA. Finally, the proposed architecture is applied to a real machine vision application: a jam detector for steel pickling lines.
Multiple-Ring Digital Communication Network

NASA Technical Reports Server (NTRS)

Kirkham, Harold

1992-01-01

Optical-fiber digital communication network to support data-acquisition and control functions of electric-power-distribution networks. Optical-fiber links of communication network follow power-distribution routes. Since fiber crosses open power switches, communication network includes multiple interconnected loops with occasional spurs. At each intersection node is needed. Nodes of communication network include power-distribution substations and power-controlling units. In addition to serving data acquisition and control functions, each node acts as repeater, passing on messages to next node(s). Multiple-ring communication network operates on new AbNET protocol and features fiber-optic communication.

Neighbourhood-consensus message passing and its potentials in image processing applications

NASA Astrophysics Data System (ADS)

Ružic, Tijana; Pižurica, Aleksandra; Philips, Wilfried

2011-03-01

In this paper, a novel algorithm for inference in Markov Random Fields (MRFs) is presented. Its goal is to find approximate maximum a posteriori estimates in a simple manner by combining neighbourhood influence of iterated conditional modes (ICM) and message passing of loopy belief propagation (LBP). We call the proposed method neighbourhood-consensus message passing because a single joint message is sent from the specified neighbourhood to the central node. The message, as a function of beliefs, represents the agreement of all nodes within the neighbourhood regarding the labels of the central node. This way we are able to overcome the disadvantages of reference algorithms, ICM and LBP. On one hand, more information is propagated in comparison with ICM, while on the other hand, the huge amount of pairwise interactions is avoided in comparison with LBP by working with neighbourhoods. The idea is related to the previously developed iterated conditional expectations algorithm. Here we revisit it and redefine it in a message passing framework in a more general form. The results on three different benchmarks demonstrate that the proposed technique can perform well both for binary and multi-label MRFs without any limitations on the model definition. Furthermore, it manifests improved performance over related techniques either in terms of quality and/or speed.
Foundations for a healthy future.

PubMed

Riley, R; Green, J; Willis, S; Soden, E; Rushby, C; Postle, D; Wakeling, S

1998-01-01

Health promotion activities with children and young people are important as they take messages about health seriously and can be influential in spreading messages about healthy living to their friends and families. Child health professionals have an important role to play in passing on messages of positive health to children and young people. Peer education is a useful way of passing on messages about health to young people. This article shares examples of three health promotion projects with children in a community trust, looking at asthma, sex education and testicular examination.
Slices: A Scalable Partitioner for Finite Element Meshes

NASA Technical Reports Server (NTRS)

Ding, H. Q.; Ferraro, R. D.

1995-01-01

A parallel partitioner for partitioning unstructured finite element meshes on distributed memory architectures is developed. The element based partitioner can handle mixtures of different element types. All algorithms adopted in the partitioner are scalable, including a communication template for unpredictable incoming messages, as shown in actual timing measurements.
Commanding and Controlling Satellite Clusters (IEEE Intelligent Systems, November/December 2000)

DTIC Science & Technology

2000-01-01

real - time operating system , a message-passing OS well suited for distributed...ground Flight processors ObjectAgent RTOS SCL RTOS RDMS Space command language Real - time operating system Rational database management system TS-21 RDMS...engineer with Princeton Satellite Systems. She is working with others to develop ObjectAgent software to run on the OSE Real Time Operating System .
A no-key-exchange secure image sharing scheme based on Shamir's three-pass cryptography protocol and the multiple-parameter fractional Fourier transform.

PubMed

Lang, Jun

2012-01-30

In this paper, we propose a novel secure image sharing scheme based on Shamir's three-pass protocol and the multiple-parameter fractional Fourier transform (MPFRFT), which can safely exchange information with no advance distribution of either secret keys or public keys between users. The image is encrypted directly by the MPFRFT spectrum without the use of phase keys, and information can be shared by transmitting the encrypted image (or message) three times between users. Numerical simulation results are given to verify the performance of the proposed algorithm.
Impact of the implementation of MPI point-to-point communications on the performance of two general sparse solvers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Amestoy, Patrick R.; Duff, Iain S.; L'Excellent, Jean-Yves

2001-10-10

We examine the mechanics of the send and receive mechanism of MPI and in particular how we can implement message passing in a robust way so that our performance is not significantly affected by changes to the MPI system. This leads us to using the Isend/Irecv protocol which will entail sometimes significant algorithmic changes. We discuss this within the context of two different algorithms for sparse Gaussian elimination that we have parallelized. One is a multifrontal solver called MUMPS, the other is a supernodal solver called SuperLU. Both algorithms are difficult to parallelize on distributed memory machines. Our initial strategiesmore » were based on simple MPI point-to-point communication primitives. With such approaches, the parallel performance of both codes are very sensitive to the MPI implementation, the way MPI internal buffers are used in particular. We then modified our codes to use more sophisticated nonblocking versions of MPI communication. This significantly improved the performance robustness (independent of the MPI buffering mechanism) and scalability, but at the cost of increased code complexity.« less
Parallel Monte Carlo transport modeling in the context of a time-dependent, three-dimensional multi-physics code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Procassini, R.J.

1997-12-31

The fine-scale, multi-space resolution that is envisioned for accurate simulations of complex weapons systems in three spatial dimensions implies flop-rate and memory-storage requirements that will only be obtained in the near future through the use of parallel computational techniques. Since the Monte Carlo transport models in these simulations usually stress both of these computational resources, they are prime candidates for parallelization. The MONACO Monte Carlo transport package, which is currently under development at LLNL, will utilize two types of parallelism within the context of a multi-physics design code: decomposition of the spatial domain across processors (spatial parallelism) and distribution ofmore » particles in a given spatial subdomain across additional processors (particle parallelism). This implementation of the package will utilize explicit data communication between domains (message passing). Such a parallel implementation of a Monte Carlo transport model will result in non-deterministic communication patterns. The communication of particles between subdomains during a Monte Carlo time step may require a significant level of effort to achieve a high parallel efficiency.« less
Extending the length and time scales of Gram–Schmidt Lyapunov vector computations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Costa, Anthony B., E-mail: acosta@northwestern.edu; Green, Jason R., E-mail: jason.green@umb.edu; Department of Chemistry, University of Massachusetts Boston, Boston, MA 02125

Lyapunov vectors have found growing interest recently due to their ability to characterize systems out of thermodynamic equilibrium. The computation of orthogonal Gram–Schmidt vectors requires multiplication and QR decomposition of large matrices, which grow as N{sup 2} (with the particle count). This expense has limited such calculations to relatively small systems and short time scales. Here, we detail two implementations of an algorithm for computing Gram–Schmidt vectors. The first is a distributed-memory message-passing method using Scalapack. The second uses the newly-released MAGMA library for GPUs. We compare the performance of both codes for Lennard–Jones fluids from N=100 to 1300 betweenmore » Intel Nahalem/Infiniband DDR and NVIDIA C2050 architectures. To our best knowledge, these are the largest systems for which the Gram–Schmidt Lyapunov vectors have been computed, and the first time their calculation has been GPU-accelerated. We conclude that Lyapunov vector calculations can be significantly extended in length and time by leveraging the power of GPU-accelerated linear algebra.« less
A hybrid parallel framework for the cellular Potts model simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jiang, Yi; He, Kejing; Dong, Shoubin

2009-01-01

The Cellular Potts Model (CPM) has been widely used for biological simulations. However, most current implementations are either sequential or approximated, which can't be used for large scale complex 3D simulation. In this paper we present a hybrid parallel framework for CPM simulations. The time-consuming POE solving, cell division, and cell reaction operation are distributed to clusters using the Message Passing Interface (MPI). The Monte Carlo lattice update is parallelized on shared-memory SMP system using OpenMP. Because the Monte Carlo lattice update is much faster than the POE solving and SMP systems are more and more common, this hybrid approachmore » achieves good performance and high accuracy at the same time. Based on the parallel Cellular Potts Model, we studied the avascular tumor growth using a multiscale model. The application and performance analysis show that the hybrid parallel framework is quite efficient. The hybrid parallel CPM can be used for the large scale simulation ({approx}10{sup 8} sites) of complex collective behavior of numerous cells ({approx}10{sup 6}).« less
Hybrid MPI/OpenMP Implementation of the ORAC Molecular Dynamics Program for Generalized Ensemble and Fast Switching Alchemical Simulations.

PubMed

Procacci, Piero

2016-06-27

We present a new release (6.0β) of the ORAC program [Marsili et al. J. Comput. Chem. 2010, 31, 1106-1116] with a hybrid OpenMP/MPI (open multiprocessing message passing interface) multilevel parallelism tailored for generalized ensemble (GE) and fast switching double annihilation (FS-DAM) nonequilibrium technology aimed at evaluating the binding free energy in drug-receptor system on high performance computing platforms. The production of the GE or FS-DAM trajectories is handled using a weak scaling parallel approach on the MPI level only, while a strong scaling force decomposition scheme is implemented for intranode computations with shared memory access at the OpenMP level. The efficiency, simplicity, and inherent parallel nature of the ORAC implementation of the FS-DAM algorithm, project the code as a possible effective tool for a second generation high throughput virtual screening in drug discovery and design. The code, along with documentation, testing, and ancillary tools, is distributed under the provisions of the General Public License and can be freely downloaded at www.chim.unifi.it/orac .
Parallelization of TWOPORFLOW, a Cartesian Grid based Two-phase Porous Media Code for Transient Thermo-hydraulic Simulations

NASA Astrophysics Data System (ADS)

Trost, Nico; Jiménez, Javier; Imke, Uwe; Sanchez, Victor

2014-06-01

TWOPORFLOW is a thermo-hydraulic code based on a porous media approach to simulate single- and two-phase flow including boiling. It is under development at the Institute for Neutron Physics and Reactor Technology (INR) at KIT. The code features a 3D transient solution of the mass, momentum and energy conservation equations for two inter-penetrating fluids with a semi-implicit continuous Eulerian type solver. The application domain of TWOPORFLOW includes the flow in standard porous media and in structured porous media such as micro-channels and cores of nuclear power plants. In the latter case, the fluid domain is coupled to a fuel rod model, describing the heat flow inside the solid structure. In this work, detailed profiling tools have been utilized to determine the optimization potential of TWOPORFLOW. As a result, bottle-necks were identified and reduced in the most feasible way, leading for instance to an optimization of the water-steam property computation. Furthermore, an OpenMP implementation addressing the routines in charge of inter-phase momentum-, energy- and mass-coupling delivered good performance together with a high scalability on shared memory architectures. In contrast to that, the approach for distributed memory systems was to solve sub-problems resulting by the decomposition of the initial Cartesian geometry. Thread communication for the sub-problem boundary updates was accomplished by the Message Passing Interface (MPI) standard.
Parallelization of a Monte Carlo particle transport simulation code

NASA Astrophysics Data System (ADS)

Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.

2010-05-01

We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
Xyce Parallel Electronic Simulator Users' Guide Version 6.8

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R.; Aadithya, Karthik Venkatraman; Mei, Ting

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been de- signed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows onemore » to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase$-$ a message passing parallel implementation $-$ which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Cluster Computing For Real Time Seismic Array Analysis.

NASA Astrophysics Data System (ADS)

Martini, M.; Giudicepietro, F.

A seismic array is an instrument composed by a dense distribution of seismic sen- sors that allow to measure the directional properties of the wavefield (slowness or wavenumber vector) radiated by a seismic source. Over the last years arrays have been widely used in different fields of seismological researches. In particular they are applied in the investigation of seismic sources on volcanoes where they can be suc- cessfully used for studying the volcanic microtremor and long period events which are critical for getting information on the volcanic systems evolution. For this reason arrays could be usefully employed for the volcanoes monitoring, however the huge amount of data produced by this type of instruments and the processing techniques which are quite time consuming limited their potentiality for this application. In order to favor a direct application of arrays techniques to continuous volcano monitoring we designed and built a small PC cluster able to near real time computing the kinematics properties of the wavefield (slowness or wavenumber vector) produced by local seis- mic source. The cluster is composed of 8 Intel Pentium-III bi-processors PC working at 550 MHz, and has 4 Gigabytes of RAM memory. It runs under Linux operating system. The developed analysis software package is based on the Multiple SIgnal Classification (MUSIC) algorithm and is written in Fortran. The message-passing part is based upon the LAM programming environment package, an open-source imple- mentation of the Message Passing Interface (MPI). The developed software system includes modules devote to receiving date by internet and graphical applications for the continuous displaying of the processing results. The system has been tested with a data set collected during a seismic experiment conducted on Etna in 1999 when two dense seismic arrays have been deployed on the northeast and the southeast flanks of this volcano. A real time continuous acquisition system has been simulated by a pro- gram which reads data from disk files and send them to a remote host by using the Internet protocols.
The impact of cultural exposure and message framing on oral health behavior: Exploring the role of message memory

PubMed Central

Brick, Cameron; McCully, Scout N.; Updegraff, John A.; Ehret, Phillip J.; Areguin, Maira A.; Sherman, David K.

2015-01-01

Background Health messages are more effective when framed to be congruent with recipient characteristics, and health practitioners can strategically decide on message features to promote adherence to recommended behaviors. We present exposure to United States (U.S.) culture as a moderator of the impact of gain- vs. loss-frame messages. Since U.S. culture emphasizes individualism and approach orientation, greater cultural exposure was expected to predict improved patient choices and memory for gain-framed messages, whereas individuals with less exposure to U.S. culture would show these advantages for loss-framed messages. Methods 223 participants viewed a written oral health message in one of three randomized conditions: gain-frame, loss-frame, or no-message control, and were given ten flosses. Cultural exposure was measured with the proportions of life spent and parents born in the U.S. At baseline and one week later, participants completed recall tests and reported recent flossing behavior. Results Message frame and cultural exposure interacted to predict improved patient decisions (increased flossing) and memory maintenance for the health message over one week. E.g., those with low cultural exposure who saw a loss-frame message flossed more. Incongruent messages led to the same flossing rates as no message. Memory retention did not explain the effect of message congruency on flossing. Limitations Flossing behavior was self-reported. Cultural exposure may only have practical application in either highly individualistic or collectivistic countries. Conclusions In healthcare settings where patients are urged to follow a behavior, asking basic demographic questions could allow medical practitioners to intentionally communicate in terms of gains or losses to improve patient decision making and treatment adherence. PMID:25654986
Impact of Cultural Exposure and Message Framing on Oral Health Behavior: Exploring the Role of Message Memory.

PubMed

Brick, Cameron; McCully, Scout N; Updegraff, John A; Ehret, Phillip J; Areguin, Maira A; Sherman, David K

2016-10-01

Health messages are more effective when framed to be congruent with recipient characteristics, and health practitioners can strategically choose message features to promote adherence to recommended behaviors. We present exposure to US culture as a moderator of the impact of gain-frame versus loss-frame messages. Since US culture emphasizes individualism and approach orientation, greater cultural exposure was expected to predict improved patient choices and memory for gain-framed messages, whereas individuals with less exposure to US culture would show these advantages for loss-framed messages. 223 participants viewed a written oral health message in 1 of 3 randomized conditions-gain-frame, loss-frame, or no-message control-and were given 10 flosses. Cultural exposure was measured with the proportions of life spent and parents born in the US. At baseline and 1 week later, participants completed recall tests and reported recent flossing behavior. Message frame and cultural exposure interacted to predict improved patient decisions (increased flossing) and memory maintenance for the health message over 1 week; for example, those with low cultural exposure who saw a loss-frame message flossed more. Incongruent messages led to the same flossing rates as no message. Memory retention did not explain the effect of message congruency on flossing. Flossing behavior was self-reported. Cultural exposure may only have practical application in either highly individualistic or collectivistic countries. In health care settings where patients are urged to follow a behavior, asking basic demographic questions could allow medical practitioners to intentionally communicate in terms of gains or losses to improve patient decision making and treatment adherence. © The Author(s) 2015.
Dispatching packets on a global combining network of a parallel computer

DOEpatents

Almasi, Gheorghe [Ardsley, NY; Archer, Charles J [Rochester, MN

2011-07-19

Methods, apparatus, and products are disclosed for dispatching packets on a global combining network of a parallel computer comprising a plurality of nodes connected for data communications using the network capable of performing collective operations and point to point operations that include: receiving, by an origin system messaging module on an origin node from an origin application messaging module on the origin node, a storage identifier and an operation identifier, the storage identifier specifying storage containing an application message for transmission to a target node, and the operation identifier specifying a message passing operation; packetizing, by the origin system messaging module, the application message into network packets for transmission to the target node, each network packet specifying the operation identifier and an operation type for the message passing operation specified by the operation identifier; and transmitting, by the origin system messaging module, the network packets to the target node.
MPIRUN: A Portable Loader for Multidisciplinary and Multi-Zonal Applications

NASA Technical Reports Server (NTRS)

Fineberg, Samuel A.; Woodrow, Thomas S. (Technical Monitor)

1994-01-01

Multidisciplinary and multi-zonal applications are an important class of applications in the area of Computational Aerosciences. In these codes, two or more distinct parallel programs or copies of a single program are utilized to model a single problem. To support such applications, it is common to use a programming model where a program is divided into several single program multiple data stream (SPMD) applications, each of which solves the equations for a single physical discipline or grid zone. These SPMD applications are then bound together to form a single multidisciplinary or multi-zonal program in which the constituent parts communicate via point-to-point message passing routines. One method for implementing the message passing portion of these codes is with the new Message Passing Interface (MPI) standard. Unfortunately, this standard only specifies the message passing portion of an application, but does not specify any portable mechanisms for loading an application. MPIRUN was developed to provide a portable means for loading MPI programs, and was specifically targeted at multidisciplinary and multi-zonal applications. Programs using MPIRUN for loading and MPI for message passing are then portable between all machines supported by MPIRUN. MPIRUN is currently implemented for the Intel iPSC/860, TMC CM5, IBM SP-1 and SP-2, Intel Paragon, and workstation clusters. Further, MPIRUN is designed to be simple enough to port easily to any system supporting MPI.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Painter, J.; McCormick, P.; Krogh, M.

This paper presents the ACL (Advanced Computing Lab) Message Passing Library. It is a high throughput, low latency communications library, based on Thinking Machines Corp.`s CMMD, upon which message passing applications can be built. The library has been implemented on the Cray T3D, Thinking Machines CM-5, SGI workstations, and on top of PVM.
A pervasive parallel framework for visualization: final report for FWP 10-014707

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moreland, Kenneth D.

2014-01-01

We are on the threshold of a transformative change in the basic architecture of highperformance computing. The use of accelerator processors, characterized by large core counts, shared but asymmetrical memory, and heavy thread loading, is quickly becoming the norm in high performance computing. These accelerators represent significant challenges in updating our existing base of software. An intrinsic problem with this transition is a fundamental programming shift from message passing processes to much more fine thread scheduling with memory sharing. Another problem is the lack of stability in accelerator implementation; processor and compiler technology is currently changing rapidly. This report documentsmore » the results of our three-year ASCR project to address these challenges. Our project includes the development of the Dax toolkit, which contains the beginnings of new algorithms for a new generation of computers and the underlying infrastructure to rapidly prototype and build further algorithms as necessary.« less

Mapping a battlefield simulation onto message-passing parallel architectures

NASA Technical Reports Server (NTRS)

Nicol, David M.

1987-01-01

Perhaps the most critical problem in distributed simulation is that of mapping: without an effective mapping of workload to processors the speedup potential of parallel processing cannot be realized. Mapping a simulation onto a message-passing architecture is especially difficult when the computational workload dynamically changes as a function of time and space; this is exactly the situation faced by battlefield simulations. This paper studies an approach where the simulated battlefield domain is first partitioned into many regions of equal size; typically there are more regions than processors. The regions are then assigned to processors; a processor is responsible for performing all simulation activity associated with the regions. The assignment algorithm is quite simple and attempts to balance load by exploiting locality of workload intensity. The performance of this technique is studied on a simple battlefield simulation implemented on the Flex/32 multiprocessor. Measurements show that the proposed method achieves reasonable processor efficiencies. Furthermore, the method shows promise for use in dynamic remapping of the simulation.
Statistical Inference in Graphical Models

DTIC Science & Technology

2008-06-17

fuse probability theory and graph theory in such a way as to permit efficient rep- resentation and computation with probability distributions. They...message passing. 59 viii 1. INTRODUCTION In approaching real-world problems, we often need to deal with uncertainty. Probability and statis- tics provide a...dynamic programming methods. However, for many sensors of interest, the signal-to-noise ratio does not allow such a treatment. Another source of
A flexible software architecture for scalable real-time image and video processing applications

NASA Astrophysics Data System (ADS)

Usamentiaga, Rubén; Molleda, Julio; García, Daniel F.; Bulnes, Francisco G.

2012-06-01

Real-time image and video processing applications require skilled architects, and recent trends in the hardware platform make the design and implementation of these applications increasingly complex. Many frameworks and libraries have been proposed or commercialized to simplify the design and tuning of real-time image processing applications. However, they tend to lack flexibility because they are normally oriented towards particular types of applications, or they impose specific data processing models such as the pipeline. Other issues include large memory footprints, difficulty for reuse and inefficient execution on multicore processors. This paper presents a novel software architecture for real-time image and video processing applications which addresses these issues. The architecture is divided into three layers: the platform abstraction layer, the messaging layer, and the application layer. The platform abstraction layer provides a high level application programming interface for the rest of the architecture. The messaging layer provides a message passing interface based on a dynamic publish/subscribe pattern. A topic-based filtering in which messages are published to topics is used to route the messages from the publishers to the subscribers interested in a particular type of messages. The application layer provides a repository for reusable application modules designed for real-time image and video processing applications. These modules, which include acquisition, visualization, communication, user interface and data processing modules, take advantage of the power of other well-known libraries such as OpenCV, Intel IPP, or CUDA. Finally, we present different prototypes and applications to show the possibilities of the proposed architecture.
Comparing the Performance of Two Dynamic Load Distribution Methods

NASA Technical Reports Server (NTRS)

Kale, L. V.

1987-01-01

Parallel processing of symbolic computations on a message-passing multi-processor presents one challenge: To effectively utilize the available processors, the load must be distributed uniformly to all the processors. However, the structure of these computations cannot be predicted in advance. go, static scheduling methods are not applicable. In this paper, we compare the performance of two dynamic, distributed load balancing methods with extensive simulation studies. The two schemes are: the Contracting Within a Neighborhood (CWN) scheme proposed by us, and the Gradient Model proposed by Lin and Keller. We conclude that although simpler, the CWN is significantly more effective at distributing the work than the Gradient model.
Heartbeat-based error diagnosis framework for distributed embedded systems

NASA Astrophysics Data System (ADS)

Mishra, Swagat; Khilar, Pabitra Mohan

2012-01-01

Distributed Embedded Systems have significant applications in automobile industry as steer-by-wire, fly-by-wire and brake-by-wire systems. In this paper, we provide a general framework for fault detection in a distributed embedded real time system. We use heartbeat monitoring, check pointing and model based redundancy to design a scalable framework that takes care of task scheduling, temperature control and diagnosis of faulty nodes in a distributed embedded system. This helps in diagnosis and shutting down of faulty actuators before the system becomes unsafe. The framework is designed and tested using a new simulation model consisting of virtual nodes working on a message passing system.
Heartbeat-based error diagnosis framework for distributed embedded systems

NASA Astrophysics Data System (ADS)

Mishra, Swagat; Khilar, Pabitra Mohan

2011-12-01

Distributed Embedded Systems have significant applications in automobile industry as steer-by-wire, fly-by-wire and brake-by-wire systems. In this paper, we provide a general framework for fault detection in a distributed embedded real time system. We use heartbeat monitoring, check pointing and model based redundancy to design a scalable framework that takes care of task scheduling, temperature control and diagnosis of faulty nodes in a distributed embedded system. This helps in diagnosis and shutting down of faulty actuators before the system becomes unsafe. The framework is designed and tested using a new simulation model consisting of virtual nodes working on a message passing system.
A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hoemmen, Mark

2010-11-01

Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, 'communication' includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches formore » orthogonalizing the vectors within each block ('normalization'). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5-20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Demeure, I.M.

The research presented here is concerned with representation techniques and tools to support the design, prototyping, simulation, and evaluation of message-based parallel, distributed computations. The author describes ParaDiGM-Parallel, Distributed computation Graph Model-a visual representation technique for parallel, message-based distributed computations. ParaDiGM provides several views of a computation depending on the aspect of concern. It is made of two complementary submodels, the DCPG-Distributed Computing Precedence Graph-model, and the PAM-Process Architecture Model-model. DCPGs are precedence graphs used to express the functionality of a computation in terms of tasks, message-passing, and data. PAM graphs are used to represent the partitioning of a computationmore » into schedulable units or processes, and the pattern of communication among those units. There is a natural mapping between the two models. He illustrates the utility of ParaDiGM as a representation technique by applying it to various computations (e.g., an adaptive global optimization algorithm, the client-server model). ParaDiGM representations are concise. They can be used in documenting the design and the implementation of parallel, distributed computations, in describing such computations to colleagues, and in comparing and contrasting various implementations of the same computation. He then describes VISA-VISual Assistant, a software tool to support the design, prototyping, and simulation of message-based parallel, distributed computations. VISA is based on the ParaDiGM model. In particular, it supports the editing of ParaDiGM graphs to describe the computations of interest, and the animation of these graphs to provide visual feedback during simulations. The graphs are supplemented with various attributes, simulation parameters, and interpretations which are procedures that can be executed by VISA.« less
Improving security of the ping-pong protocol

NASA Astrophysics Data System (ADS)

Zawadzki, Piotr

2013-01-01

A security layer for the asymptotically secure ping-pong protocol is proposed and analyzed in the paper. The operation of the improvement exploits inevitable errors introduced by the eavesdropping in the control and message modes. Its role is similar to the privacy amplification algorithms known from the quantum key distribution schemes. Messages are processed in blocks which guarantees that an eavesdropper is faced with a computationally infeasible problem as long as the system parameters are within reasonable limits. The introduced additional information preprocessing does not require quantum memory registers and confidential communication is possible without prior key agreement or some shared secret.
Class network routing

DOEpatents

Bhanot, Gyan [Princeton, NJ; Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Takken, Todd E [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

2009-09-08

Class network routing is implemented in a network such as a computer network comprising a plurality of parallel compute processors at nodes thereof. Class network routing allows a compute processor to broadcast a message to a range (one or more) of other compute processors in the computer network, such as processors in a column or a row. Normally this type of operation requires a separate message to be sent to each processor. With class network routing pursuant to the invention, a single message is sufficient, which generally reduces the total number of messages in the network as well as the latency to do a broadcast. Class network routing is also applied to dense matrix inversion algorithms on distributed memory parallel supercomputers with hardware class function (multicast) capability. This is achieved by exploiting the fact that the communication patterns of dense matrix inversion can be served by hardware class functions, which results in faster execution times.
Hybrid-view programming of nuclear fusion simulation code in the PGAS parallel programming language XcalableMP

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi

Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensionalmore » gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.« less
Hybrid-view programming of nuclear fusion simulation code in the PGAS parallel programming language XcalableMP

DOE PAGES

Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi; ...

2016-06-01

Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensionalmore » gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.« less
Advanced Networking and Distributed Systems Defense Advanced Research Projects Agency

DTIC Science & Technology

1992-06-01

balanced. The averagenumber of raviving nodes in each subcube is 2I (1 - *A node cannot transmit a message to a faulty p),. Each will send a units of...space at the processing elements. After being tal traffic will go to memory module 0 in a 1024-node generated, a packet is discarded if it cannot be
ms2: A molecular simulation tool for thermodynamic properties

NASA Astrophysics Data System (ADS)

Deublein, Stephan; Eckl, Bernhard; Stoll, Jürgen; Lishchuk, Sergey V.; Guevara-Carrion, Gabriela; Glass, Colin W.; Merker, Thorsten; Bernreuther, Martin; Hasse, Hans; Vrabec, Jadran

2011-11-01

This work presents the molecular simulation program ms2 that is designed for the calculation of thermodynamic properties of bulk fluids in equilibrium consisting of small electro-neutral molecules. ms2 features the two main molecular simulation techniques, molecular dynamics (MD) and Monte-Carlo. It supports the calculation of vapor-liquid equilibria of pure fluids and multi-component mixtures described by rigid molecular models on the basis of the grand equilibrium method. Furthermore, it is capable of sampling various classical ensembles and yields numerous thermodynamic properties. To evaluate the chemical potential, Widom's test molecule method and gradual insertion are implemented. Transport properties are determined by equilibrium MD simulations following the Green-Kubo formalism. ms2 is designed to meet the requirements of academia and industry, particularly achieving short response times and straightforward handling. It is written in Fortran90 and optimized for a fast execution on a broad range of computer architectures, spanning from single processor PCs over PC-clusters and vector computers to high-end parallel machines. The standard Message Passing Interface (MPI) is used for parallelization and ms2 is therefore easily portable to different computing platforms. Feature tools facilitate the interaction with the code and the interpretation of input and output files. The accuracy and reliability of ms2 has been shown for a large variety of fluids in preceding work. Program summaryProgram title:ms2 Catalogue identifier: AEJF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEJF_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Special Licence supplied by the authors No. of lines in distributed program, including test data, etc.: 82 794 No. of bytes in distributed program, including test data, etc.: 793 705 Distribution format: tar.gz Programming language: Fortran90 Computer: The simulation tool ms2 is usable on a wide variety of platforms, from single processor machines over PC-clusters and vector computers to vector-parallel architectures. (Tested with Fortran compilers: gfortran, Intel, PathScale, Portland Group and Sun Studio.) Operating system: Unix/Linux, Windows Has the code been vectorized or parallelized?: Yes. Message Passing Interface (MPI) protocol Scalability. Excellent scalability up to 16 processors for molecular dynamics and >512 processors for Monte-Carlo simulations. RAM:ms2 runs on single processors with 512 MB RAM. The memory demand rises with increasing number of processors used per node and increasing number of molecules. Classification: 7.7, 7.9, 12 External routines: Message Passing Interface (MPI) Nature of problem: Calculation of application oriented thermodynamic properties for rigid electro-neutral molecules: vapor-liquid equilibria, thermal and caloric data as well as transport properties of pure fluids and multi-component mixtures. Solution method: Molecular dynamics, Monte-Carlo, various classical ensembles, grand equilibrium method, Green-Kubo formalism. Restrictions: No. The system size is user-defined. Typical problems addressed by ms2 can be solved by simulating systems containing typically 2000 molecules or less. Unusual features: Feature tools are available for creating input files, analyzing simulation results and visualizing molecular trajectories. Additional comments: Sample makefiles for multiple operation platforms are provided. Documentation is provided with the installation package and is available at http://www.ms-2.de. Running time: The running time of ms2 depends on the problem set, the system size and the number of processes used in the simulation. Running four processes on a "Nehalem" processor, simulations calculating VLE data take between two and twelve hours, calculating transport properties between six and 24 hours.
Comparison of two paradigms for distributed shared memory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Levelt, W.G.; Kaashoek, M.F.; Bal, H.E.

1990-08-01

The paper compares two paradigms for Distributed Shared Memory on loosely coupled computing systems: the shared data-object model as used in Orca, a programming language specially designed for loosely coupled computing systems and the Shared Virtual Memory model. For both paradigms the authors have implemented two systems, one using only point-to-point messages, the other using broadcasting as well. They briefly describe these two paradigms and their implementations. Then they compare their performance on four applications: the traveling salesman problem, alpha-beta search, matrix multiplication and the all pairs shortest paths problem. The measurements show that both paradigms can be used efficientlymore » for programming large-grain parallel applications. Significant speedups were obtained on all applications. The unstructured Shared Virtual Memory paradigm achieves the best absolute performance, although this is largely due to the preliminary nature of the Orca compiler used. The structured shared data-object model achieves the highest speedups and is much easier to program and to debug.« less
Development of an HL7 interface engine, based on tree structure and streaming algorithm, for large-size messages which include image data.

PubMed

Um, Ki Sung; Kwak, Yun Sik; Cho, Hune; Kim, Il Kon

2005-11-01

A basic assumption of Health Level Seven (HL7) protocol is 'No limitation of message length'. However, most existing commercial HL7 interface engines do limit message length because they use the string array method, which is run in the main memory for the HL7 message parsing process. Specifically, messages with image and multi-media data create a long string array and thus cause the computer system to raise critical and fatal problem. Consequently, HL7 messages cannot handle the image and multi-media data necessary in modern medical records. This study aims to solve this problem with the 'streaming algorithm' method. This new method for HL7 message parsing applies the character-stream object which process character by character between the main memory and hard disk device with the consequence that the processing load on main memory could be alleviated. The main functions of this new engine are generating, parsing, validating, browsing, sending, and receiving HL7 messages. Also, the engine can parse and generate XML-formatted HL7 messages. This new HL7 engine successfully exchanged HL7 messages with 10 megabyte size images and discharge summary information between two university hospitals.
Keeping the Spirits Up: The Effect of Teachers' and Parents' Emotional Support on Children's Working Memory Performance.

PubMed

Vandenbroucke, Loren; Spilt, Jantine; Verschueren, Karine; Baeyens, Dieter

2017-01-01

Working memory, used to temporarily store and mentally manipulate information, is important for children's learning. It is therefore valuable to understand which (contextual) factors promote or hinder working memory performance. Recent research shows positive associations between positive parent-child and teacher-student interactions and working memory performance and development. However, no study has yet experimentally investigated how parents and teachers affect working memory performance. Based on attachment theory, the current study investigated the role of parent and teacher emotional support in promoting working memory performance by buffering the negative effect of social stress. Questionnaires and an experimental session were completed by 170 children from grade 1 to 2 ( M age = 7 years 6 months, SD = 7 months). Questionnaires were used to assess children's perceptions of the teacher-student and parent-child relationship. During an experimental session, working memory was measured with the Corsi task backward (Milner, 1971) in a pre- and post-test design. In-between the tests stress was induced in the children using the Cyberball paradigm (Williams et al., 2000). Emotional support was manipulated (between-subjects) through an audio message (either a weather report, a supportive message of a stranger, a supportive message of a parent, or a supportive message of a teacher). Results of repeated measures ANOVA showed no clear effect of the stress induction. Nevertheless, an effect of parent and teacher support was found and depended on the quality of the parent-child relationship. When children had a positive relationship with their parent, support of parents and teachers had little effect on working memory performance. When children had a negative relationship with their parent, a supportive message of that parent decreased working memory performance, while a supportive message from the teacher increased performance. In sum, the current study suggests that parents and teachers can support working memory performance by being supportive for the child. Teacher support is most effective when the child has a negative relationship with the parent. These insights can give direction to specific measures aimed at preventing and resolving working memory problems and related issues.
Keeping the Spirits Up: The Effect of Teachers’ and Parents’ Emotional Support on Children’s Working Memory Performance

PubMed Central

Vandenbroucke, Loren; Spilt, Jantine; Verschueren, Karine; Baeyens, Dieter

2017-01-01

Working memory, used to temporarily store and mentally manipulate information, is important for children’s learning. It is therefore valuable to understand which (contextual) factors promote or hinder working memory performance. Recent research shows positive associations between positive parent–child and teacher–student interactions and working memory performance and development. However, no study has yet experimentally investigated how parents and teachers affect working memory performance. Based on attachment theory, the current study investigated the role of parent and teacher emotional support in promoting working memory performance by buffering the negative effect of social stress. Questionnaires and an experimental session were completed by 170 children from grade 1 to 2 (Mage = 7 years 6 months, SD = 7 months). Questionnaires were used to assess children’s perceptions of the teacher–student and parent–child relationship. During an experimental session, working memory was measured with the Corsi task backward (Milner, 1971) in a pre- and post-test design. In-between the tests stress was induced in the children using the Cyberball paradigm (Williams et al., 2000). Emotional support was manipulated (between-subjects) through an audio message (either a weather report, a supportive message of a stranger, a supportive message of a parent, or a supportive message of a teacher). Results of repeated measures ANOVA showed no clear effect of the stress induction. Nevertheless, an effect of parent and teacher support was found and depended on the quality of the parent–child relationship. When children had a positive relationship with their parent, support of parents and teachers had little effect on working memory performance. When children had a negative relationship with their parent, a supportive message of that parent decreased working memory performance, while a supportive message from the teacher increased performance. In sum, the current study suggests that parents and teachers can support working memory performance by being supportive for the child. Teacher support is most effective when the child has a negative relationship with the parent. These insights can give direction to specific measures aimed at preventing and resolving working memory problems and related issues. PMID:28421026
Eigensolver for a Sparse, Large Hermitian Matrix

NASA Technical Reports Server (NTRS)

Tisdale, E. Robert; Oyafuso, Fabiano; Klimeck, Gerhard; Brown, R. Chris

2003-01-01

A parallel-processing computer program finds a few eigenvalues in a sparse Hermitian matrix that contains as many as 100 million diagonal elements. This program finds the eigenvalues faster, using less memory, than do other, comparable eigensolver programs. This program implements a Lanczos algorithm in the American National Standards Institute/ International Organization for Standardization (ANSI/ISO) C computing language, using the Message Passing Interface (MPI) standard to complement an eigensolver in PARPACK. [PARPACK (Parallel Arnoldi Package) is an extension, to parallel-processing computer architectures, of ARPACK (Arnoldi Package), which is a collection of Fortran 77 subroutines that solve large-scale eigenvalue problems.] The eigensolver runs on Beowulf clusters of computers at the Jet Propulsion Laboratory (JPL).
First Applications of the New Parallel Krylov Solver for MODFLOW on a National and Global Scale

NASA Astrophysics Data System (ADS)

Verkaik, J.; Hughes, J. D.; Sutanudjaja, E.; van Walsum, P.

2016-12-01

Integrated high-resolution hydrologic models are increasingly being used for evaluating water management measures at field scale. Their drawbacks are large memory requirements and long run times. Examples of such models are The Netherlands Hydrological Instrument (NHI) model and the PCRaster Global Water Balance (PCR-GLOBWB) model. Typical simulation periods are 30-100 years with daily timesteps. The NHI model predicts water demands in periods of drought, supporting operational and long-term water-supply decisions. The NHI is a state-of-the-art coupling of several models: a 7-layer MODFLOW groundwater model ( 6.5M 250m cells), a MetaSWAP model for the unsaturated zone (Richards emulator of 0.5M cells), and a surface water model (MOZART-DM). The PCR-GLOBWB model provides a grid-based representation of global terrestrial hydrology and this work uses the version that includes a 2-layer MODFLOW groundwater model ( 4.5M 10km cells). The Parallel Krylov Solver (PKS) speeds up computation by both distributed memory parallelization (Message Passing Interface) and shared memory parallelization (Open Multi-Processing). PKS includes conjugate gradient, bi-conjugate gradient stabilized, and generalized minimal residual linear accelerators that use an overlapping additive Schwarz domain decomposition preconditioner. PKS can be used for both structured and unstructured grids and has been fully integrated in MODFLOW-USG using METIS partitioning and in iMODFLOW using RCB partitioning. iMODFLOW is an accelerated version of MODFLOW-2005 that is implicitly and online coupled to MetaSWAP. Results for benchmarks carried out on the Cartesius Dutch supercomputer (https://userinfo.surfsara.nl/systems/cartesius) for the PCRGLOB-WB model and on a 2x16 core Windows machine for the NHI model show speedups up to 10-20 and 5-10, respectively.

I/O Parallelization for the Goddard Earth Observing System Data Assimilation System (GEOS DAS)

NASA Technical Reports Server (NTRS)

Lucchesi, Rob; Sawyer, W.; Takacs, L. L.; Lyster, P.; Zero, J.

1998-01-01

The National Aeronautics and Space Administration (NASA) Data Assimilation Office (DAO) at the Goddard Space Flight Center (GSFC) has developed the GEOS DAS, a data assimilation system that provides production support for NASA missions and will support NASA's Earth Observing System (EOS) in the coming years. The GEOS DAS will be used to provide background fields of meteorological quantities to EOS satellite instrument teams for use in their data algorithms as well as providing assimilated data sets for climate studies on decadal time scales. The DAO has been involved in prototyping parallel implementations of the GEOS DAS for a number of years and is now embarking on an effort to convert the production version from shared-memory parallelism to distributed-memory parallelism using the portable Message-Passing Interface (MPI). The GEOS DAS consists of two main components, an atmospheric General Circulation Model (GCM) and a Physical-space Statistical Analysis System (PSAS). The GCM operates on data that are stored on a regular grid while PSAS works with observational data that are scattered irregularly throughout the atmosphere. As a result, the two components have different data decompositions. The GCM is decomposed horizontally as a checkerboard with all vertical levels of each box existing on the same processing element(PE). The dynamical core of the GCM can also operate on a rotated grid, which requires communication-intensive grid transformations during GCM integration. PSAS groups observations on PEs in a more irregular and dynamic fashion.
Xyce™ Parallel Electronic Simulator Users' Guide, Version 6.5.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R.; Aadithya, Karthik V.; Mei, Ting

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The information herein is subject to change without notice. Copyright © 2002-2016 Sandia Corporation. All rights reserved.« less
How communication goals determine when audience tuning biases memory.

PubMed

Echterhoff, Gerald; Higgins, E Tory; Kopietz, René; Groll, Stephan

2008-02-01

After tuning their message to suit their audience's attitude, communicators' own memories for the original information (e.g., a target person's behaviors) often reflect the biased view expressed in their message--producing an audience-congruent memory bias. Exploring the motivational circumstances of message production, the authors investigated whether this bias depends on the goals driving audience tuning. In 4 experiments, the memory bias was found to a greater extent when audience tuning served the creation of a shared reality than when it served alternative, nonshared reality goals (being polite toward a stigmatized-group audience; obtaining incentives; being entertaining; complying with a blatant demand). In addition, the authors found that these effects were mediated by the epistemic trust in the audience-congruent view but not by the rehearsal or accurate retrieval of the original input information, the ability to discriminate between the original and the message information, or a contrast away from extremely tuned messages. The central role of epistemic trust, a measure of the communicators' experience of shared reality, was supported in meta-analyses across the experiments. PsycINFO Database Record (c) 2008 APA, all rights reserved.
Information Management and Learning in Computer Conferences: Coping with Irrelevant and Unconnected Messages.

ERIC Educational Resources Information Center

Schwan, Stephan; Straub, Daniela; Hesse, Friedrich W.

2002-01-01

Describes a study of computer conferencing where learners interacted over the course of four log-in sessions to acquire the knowledge sufficient to pass a learning test. Studied the number of messages irrelevant to the topic, explicit threading of messages, reading times of relevant messages, and learning outcomes. (LRW)
Getting the message across: age differences in the positive and negative framing of health care messages.

PubMed

Shamaskin, Andrea M; Mikels, Joseph A; Reed, Andrew E

2010-09-01

Although valenced health care messages influence impressions, memory, and behavior (Levin, Schneider, & Gaeth, 1998) and the processing of valenced information changes with age (Carstensen & Mikels, 2005), these 2 lines of research have thus far been disconnected. This study examined impressions of, and memory for, positively and negatively framed health care messages that were presented in pamphlets to 25 older adults and 24 younger adults. Older adults relative to younger adults rated positive pamphlets more informative than negative pamphlets and remembered a higher proportion of positive to negative messages. However, older adults misremembered negative messages to be positive. These findings demonstrate the age-related positivity effect in health care messages with promise as to the persuasive nature and lingering effects of positive messages. (c) 2010 APA, all rights reserved.
Virtual memory support for distributed computing environments using a shared data object model

NASA Astrophysics Data System (ADS)

Huang, F.; Bacon, J.; Mapp, G.

1995-12-01

Conventional storage management systems provide one interface for accessing memory segments and another for accessing secondary storage objects. This hinders application programming and affects overall system performance due to mandatory data copying and user/kernel boundary crossings, which in the microkernel case may involve context switches. Memory-mapping techniques may be used to provide programmers with a unified view of the storage system. This paper extends such techniques to support a shared data object model for distributed computing environments in which good support for coherence and synchronization is essential. The approach is based on a microkernel, typed memory objects, and integrated coherence control. A microkernel architecture is used to support multiple coherence protocols and the addition of new protocols. Memory objects are typed and applications can choose the most suitable protocols for different types of object to avoid protocol mismatch. Low-level coherence control is integrated with high-level concurrency control so that the number of messages required to maintain memory coherence is reduced and system-wide synchronization is realized without severely impacting the system performance. These features together contribute a novel approach to the support for flexible coherence under application control.
Towards a HPC-oriented parallel implementation of a learning algorithm for bioinformatics applications

PubMed Central

2014-01-01

Background The huge quantity of data produced in Biomedical research needs sophisticated algorithmic methodologies for its storage, analysis, and processing. High Performance Computing (HPC) appears as a magic bullet in this challenge. However, several hard to solve parallelization and load balancing problems arise in this context. Here we discuss the HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U-BRAIN). The U-BRAIN algorithm is a learning algorithm that finds a Boolean formula in disjunctive normal form (DNF), of approximately minimum complexity, that is consistent with a set of data (instances) which may have missing bits. The conjunctive terms of the formula are computed in an iterative way by identifying, from the given data, a family of sets of conditions that must be satisfied by all the positive instances and violated by all the negative ones; such conditions allow the computation of a set of coefficients (relevances) for each attribute (literal), that form a probability distribution, allowing the selection of the term literals. The great versatility that characterizes it, makes U-BRAIN applicable in many of the fields in which there are data to be analyzed. However the memory and the execution time required by the running are of O(n3) and of O(n5) order, respectively, and so, the algorithm is unaffordable for huge data sets. Results We find mathematical and programming solutions able to lead us towards the implementation of the algorithm U-BRAIN on parallel computers. First we give a Dynamic Programming model of the U-BRAIN algorithm, then we minimize the representation of the relevances. When the data are of great size we are forced to use the mass memory, and depending on where the data are actually stored, the access times can be quite different. According to the evaluation of algorithmic efficiency based on the Disk Model, in order to reduce the costs of the communications between different memories (RAM, Cache, Mass, Virtual) and to achieve efficient I/O performance, we design a mass storage structure able to access its data with a high degree of temporal and spatial locality. Then we develop a parallel implementation of the algorithm. We model it as a SPMD system together to a Message-Passing Programming Paradigm. Here, we adopt the high-level message-passing systems MPI (Message Passing Interface) in the version for the Java programming language, MPJ. The parallel processing is organized into four stages: partitioning, communication, agglomeration and mapping. The decomposition of the U-BRAIN algorithm determines the necessity of a communication protocol design among the processors involved. Efficient synchronization design is also discussed. Conclusions In the context of a collaboration between public and private institutions, the parallel model of U-BRAIN has been implemented and tested on the INTEL XEON E7xxx and E5xxx family of the CRESCO structure of Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), developed in the framework of the European Grid Infrastructure (EGI), a series of efforts to provide access to high-throughput computing resources across Europe using grid computing techniques. The implementation is able to minimize both the memory space and the execution time. The test data used in this study are IPDATA (Irvine Primate splice- junction DATA set), a subset of HS3D (Homo Sapiens Splice Sites Dataset) and a subset of COSMIC (the Catalogue of Somatic Mutations in Cancer). The execution time and the speed-up on IPDATA reach the best values within about 90 processors. Then the parallelization advantage is balanced by the greater cost of non-local communications between the processors. A similar behaviour is evident on HS3D, but at a greater number of processors, so evidencing the direct relationship between data size and parallelization gain. This behaviour is confirmed on COSMIC. Overall, the results obtained show that the parallel version is up to 30 times faster than the serial one. PMID:25077818
Towards a HPC-oriented parallel implementation of a learning algorithm for bioinformatics applications.

PubMed

D'Angelo, Gianni; Rampone, Salvatore

2014-01-01

The huge quantity of data produced in Biomedical research needs sophisticated algorithmic methodologies for its storage, analysis, and processing. High Performance Computing (HPC) appears as a magic bullet in this challenge. However, several hard to solve parallelization and load balancing problems arise in this context. Here we discuss the HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U-BRAIN). The U-BRAIN algorithm is a learning algorithm that finds a Boolean formula in disjunctive normal form (DNF), of approximately minimum complexity, that is consistent with a set of data (instances) which may have missing bits. The conjunctive terms of the formula are computed in an iterative way by identifying, from the given data, a family of sets of conditions that must be satisfied by all the positive instances and violated by all the negative ones; such conditions allow the computation of a set of coefficients (relevances) for each attribute (literal), that form a probability distribution, allowing the selection of the term literals. The great versatility that characterizes it, makes U-BRAIN applicable in many of the fields in which there are data to be analyzed. However the memory and the execution time required by the running are of O(n(3)) and of O(n(5)) order, respectively, and so, the algorithm is unaffordable for huge data sets. We find mathematical and programming solutions able to lead us towards the implementation of the algorithm U-BRAIN on parallel computers. First we give a Dynamic Programming model of the U-BRAIN algorithm, then we minimize the representation of the relevances. When the data are of great size we are forced to use the mass memory, and depending on where the data are actually stored, the access times can be quite different. According to the evaluation of algorithmic efficiency based on the Disk Model, in order to reduce the costs of the communications between different memories (RAM, Cache, Mass, Virtual) and to achieve efficient I/O performance, we design a mass storage structure able to access its data with a high degree of temporal and spatial locality. Then we develop a parallel implementation of the algorithm. We model it as a SPMD system together to a Message-Passing Programming Paradigm. Here, we adopt the high-level message-passing systems MPI (Message Passing Interface) in the version for the Java programming language, MPJ. The parallel processing is organized into four stages: partitioning, communication, agglomeration and mapping. The decomposition of the U-BRAIN algorithm determines the necessity of a communication protocol design among the processors involved. Efficient synchronization design is also discussed. In the context of a collaboration between public and private institutions, the parallel model of U-BRAIN has been implemented and tested on the INTEL XEON E7xxx and E5xxx family of the CRESCO structure of Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), developed in the framework of the European Grid Infrastructure (EGI), a series of efforts to provide access to high-throughput computing resources across Europe using grid computing techniques. The implementation is able to minimize both the memory space and the execution time. The test data used in this study are IPDATA (Irvine Primate splice- junction DATA set), a subset of HS3D (Homo Sapiens Splice Sites Dataset) and a subset of COSMIC (the Catalogue of Somatic Mutations in Cancer). The execution time and the speed-up on IPDATA reach the best values within about 90 processors. Then the parallelization advantage is balanced by the greater cost of non-local communications between the processors. A similar behaviour is evident on HS3D, but at a greater number of processors, so evidencing the direct relationship between data size and parallelization gain. This behaviour is confirmed on COSMIC. Overall, the results obtained show that the parallel version is up to 30 times faster than the serial one.
What it Takes to Get Passed On: Message Content, Style, and Structure as Predictors of Retransmission in the Boston Marathon Bombing Response

PubMed Central

Sutton, Jeannette; Gibson, C. Ben; Spiro, Emma S.; League, Cedar; Fitzhugh, Sean M.; Butts, Carter T.

2015-01-01

Message retransmission is a central aspect of information diffusion. In a disaster context, the passing on of official warning messages by members of the public also serves as a behavioral indicator of message salience, suggesting that particular messages are (or are not) perceived by the public to be both noteworthy and valuable enough to share with others. This study provides the first examination of terse message retransmission of official warning messages in response to a domestic terrorist attack, the Boston Marathon Bombing in 2013. Using messages posted from public officials’ Twitter accounts that were active during the period of the Boston Marathon bombing and manhunt, we examine the features of messages that are associated with their retransmission. We focus on message content, style, and structure, as well as the networked relationships of message senders to answer the question: what characteristics of a terse message sent under conditions of imminent threat predict its retransmission among members of the public? We employ a negative binomial model to examine how message characteristics affect message retransmission. We find that, rather than any single effect dominating the process, retransmission of official Tweets during the Boston bombing response was jointly influenced by various message content, style, and sender characteristics. These findings suggest the need for more work that investigates impact of multiple factors on the allocation of attention and on message retransmission during hazard events. PMID:26295584
An alcohol message beneath the surface of ER: how implicit memory influences viewers' health attitudes and intentions using entertainment-education.

PubMed

Kim, Kyongseok; Lee, Mina; Macias, Wendy

2014-01-01

While previous research on entertainment-education has assessed its effectiveness, primarily at the conscious level (e.g., free recall and self-reported change in knowledge), few studies have explored its effect on viewers' implicit knowledge. To fill this gap, this study examined the mechanism through which viewers form implicit memory of short health messages inserted in a primetime TV show and its preconscious effects on viewers' health attitudes and intentions. An experiment was conducted using a 3-group (health message: present vs. absent vs. control), posttest-only design with additional planned analyses of differences by subject variables (past experience and involvement). Overall, findings supported the hypothesized effects of implicit memory of a brief antialcohol message embedded in an ER episode on college students' attitudes and intentions against binge drinking. Results showed that participants who were exposed to the health message reported less positive attitudes toward binge drinking and lower intentions to binge drink, compared with those who were not exposed; the causal relations among viewers' implicit memory, attitudes, and intentions were also validated. Results also showed that individuals' past experience and involvement moderated the effects of the health message on attitudes and intentions. Theoretical explanations and practical implications are discussed.
The Edge-Disjoint Path Problem on Random Graphs by Message-Passing.

PubMed

Altarelli, Fabrizio; Braunstein, Alfredo; Dall'Asta, Luca; De Bacco, Caterina; Franz, Silvio

2015-01-01

We present a message-passing algorithm to solve a series of edge-disjoint path problems on graphs based on the zero-temperature cavity equations. Edge-disjoint paths problems are important in the general context of routing, that can be defined by incorporating under a unique framework both traffic optimization and total path length minimization. The computation of the cavity equations can be performed efficiently by exploiting a mapping of a generalized edge-disjoint path problem on a star graph onto a weighted maximum matching problem. We perform extensive numerical simulations on random graphs of various types to test the performance both in terms of path length minimization and maximization of the number of accommodated paths. In addition, we test the performance on benchmark instances on various graphs by comparison with state-of-the-art algorithms and results found in the literature. Our message-passing algorithm always outperforms the others in terms of the number of accommodated paths when considering non trivial instances (otherwise it gives the same trivial results). Remarkably, the largest improvement in performance with respect to the other methods employed is found in the case of benchmarks with meshes, where the validity hypothesis behind message-passing is expected to worsen. In these cases, even though the exact message-passing equations do not converge, by introducing a reinforcement parameter to force convergence towards a sub optimal solution, we were able to always outperform the other algorithms with a peak of 27% performance improvement in terms of accommodated paths. On random graphs, we numerically observe two separated regimes: one in which all paths can be accommodated and one in which this is not possible. We also investigate the behavior of both the number of paths to be accommodated and their minimum total length.
The Edge-Disjoint Path Problem on Random Graphs by Message-Passing

PubMed Central

2015-01-01

We present a message-passing algorithm to solve a series of edge-disjoint path problems on graphs based on the zero-temperature cavity equations. Edge-disjoint paths problems are important in the general context of routing, that can be defined by incorporating under a unique framework both traffic optimization and total path length minimization. The computation of the cavity equations can be performed efficiently by exploiting a mapping of a generalized edge-disjoint path problem on a star graph onto a weighted maximum matching problem. We perform extensive numerical simulations on random graphs of various types to test the performance both in terms of path length minimization and maximization of the number of accommodated paths. In addition, we test the performance on benchmark instances on various graphs by comparison with state-of-the-art algorithms and results found in the literature. Our message-passing algorithm always outperforms the others in terms of the number of accommodated paths when considering non trivial instances (otherwise it gives the same trivial results). Remarkably, the largest improvement in performance with respect to the other methods employed is found in the case of benchmarks with meshes, where the validity hypothesis behind message-passing is expected to worsen. In these cases, even though the exact message-passing equations do not converge, by introducing a reinforcement parameter to force convergence towards a sub optimal solution, we were able to always outperform the other algorithms with a peak of 27% performance improvement in terms of accommodated paths. On random graphs, we numerically observe two separated regimes: one in which all paths can be accommodated and one in which this is not possible. We also investigate the behavior of both the number of paths to be accommodated and their minimum total length. PMID:26710102
The equipment access software for a distributed UNIX-based accelerator control system

NASA Astrophysics Data System (ADS)

Trofimov, Nikolai; Zelepoukine, Serguei; Zharkov, Eugeny; Charrue, Pierre; Gareyte, Claire; Poirier, Hervé

1994-12-01

This paper presents a generic equipment access software package for a distributed control system using computers with UNIX or UNIX-like operating systems. The package consists of three main components, an application Equipment Access Library, Message Handler and Equipment Data Base. An application task, which may run in any computer in the network, sends requests to access equipment through Equipment Library calls. The basic request is in the form Equipment-Action-Data and is routed via a remote procedure call to the computer to which the given equipment is connected. In this computer the request is received by the Message Handler. According to the type of the equipment connection, the Message Handler either passes the request to the specific process software in the same computer or forwards it to a lower level network of equipment controllers using MIL1553B, GPIB, RS232 or BITBUS communication. The answer is then returned to the calling application. Descriptive information required for request routing and processing is stored in the real-time Equipment Data Base. The package has been written to be portable and is currently available on DEC Ultrix, LynxOS, HPUX, XENIX, OS-9 and Apollo domain.
Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jin, Shuangshuang; Chen, Yousu; Wu, Di

2015-12-09

Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Messagemore » Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.« less
Support for non-locking parallel reception of packets belonging to a single memory reception FIFO

DOEpatents

Chen, Dong [Yorktown Heights, NY; Heidelberger, Philip [Yorktown Heights, NY; Salapura, Valentina [Yorktown Heights, NY; Senger, Robert M [Yorktown Heights, NY; Steinmacher-Burow, Burkhard [Boeblingen, DE; Sugawara, Yutaka [Yorktown Heights, NY

2011-01-27

A method and apparatus for distributed parallel messaging in a parallel computing system. A plurality of DMA engine units are configured in a multiprocessor system to operate in parallel, one DMA engine unit for transferring a current packet received at a network reception queue to a memory location in a memory FIFO (rmFIFO) region of a memory. A control unit implements logic to determine whether any prior received packet destined for that rmFIFO is still in a process of being stored in the associated memory by another DMA engine unit of the plurality, and prevent the one DMA engine unit from indicating completion of storing the current received packet in the reception memory FIFO (rmFIFO) until all prior received packets destined for that rmFIFO are completely stored by the other DMA engine units. Thus, there is provided non-locking support so that multiple packets destined for a single rmFIFO are transferred and stored in parallel to predetermined locations in a memory.
Countering Craving with Disgust Images: Examining Nicotine Withdrawn Smokers' Motivated Message Processing of Anti-Tobacco Public Service Announcements.

PubMed

Clayton, Russell B; Leshner, Glenn; Tomko, Rachel L; Trull, Timothy J; Piasecki, Thomas M

2017-03-01

There is a lack of research examining whether smoking cues in anti-tobacco advertisements elicit cravings, or whether this effect is moderated by countervailing message attributes, such as disgusting images. Furthermore, no research has examined how these types of messages influence nicotine withdrawn smokers' cognitive processing and associated behavioral intentions. At a laboratory session, participants (N = 50 nicotine-deprived adults) were tested for cognitive processing and recognition memory of 12 anti-tobacco advertisements varying in depictions of smoking cues and disgust content. Self-report smoking urges and intentions to quit smoking were measured after each message. The results from this experiment indicated that smoking cue messages activated appetitive/approach motivation resulting in enhanced attention and memory, but increased craving and reduced quit intentions. Disgust messages also enhanced attention and memory, but activated aversive/avoid motivation resulting in reduced craving and increased quit intentions. The combination of smoking cues and disgust content resulted in moderate amounts of craving and quit intentions, but also led to heart rate acceleration (indicating defensive processing) and poorer recognition of message content. These data suggest that in order to counter nicotine-deprived smokers' craving and prolong abstinence, anti-tobacco messages should omit smoking cues but include disgust. Theoretical implications are also discussed.
Countering Craving with Disgust Images: Examining Nicotine Withdrawn Smokers’ Motivated Message Processing of Anti-Tobacco Public Service Announcements

PubMed Central

CLAYTON, RUSSELL B.; LESHNER, GLENN; TOMKO, RACHEL L.; TRULL, TIMOTHY J.; PIASECKI, THOMAS M.

2017-01-01

There is a lack of research examining whether smoking cues in anti-tobacco advertisements elicit cravings, or whether this effect is moderated by countervailing message attributes, such as disgusting images. Furthermore, no research has examined how these types of messages influence nicotine withdrawn smokers’ cognitive processing and associated behavioral intentions. At a laboratory session, participants (N = 50 nicotine-deprived adults) were tested for cognitive processing and recognition memory of 12 anti-tobacco advertisements varying in depictions of smoking cues and disgust content. Self-report smoking urges and intentions to quit smoking were measured after each message. The results from this experiment indicated that smoking cue messages activated appetitive/approach motivation resulting in enhanced attention and memory, but increased craving and reduced quit intentions. Disgust messages also enhanced attention and memory, but activated aversive/avoid motivation resulting in reduced craving and increased quit intentions. The combination of smoking cues and disgust content resulted in moderate amounts of craving and quit intentions, but also led to heart rate acceleration (indicating defensive processing) and poorer recognition of message content. These data suggest that in order to counter nicotine-deprived smokers’ craving and prolong abstinence, anti-tobacco messages should omit smoking cues but include disgust. Theoretical implications are also discussed. PMID:28248620
Positive messages enhance older adults' motivation and recognition memory for physical activity programmes.

PubMed

Notthoff, Nanna; Klomp, Peter; Doerwald, Friederike; Scheibe, Susanne

2016-09-01

Although physical activity is an effective way to cope with ageing-related impairments, few older people are motivated to turn their sedentary lifestyle into an active one. Recent evidence suggests that walking can be more effectively promoted in older adults with positive messages about the benefits of walking than with negative messages about the risks of inactivity. This study examined motivation and memory as the supposed mechanisms underlying the greater effectiveness of positively framed compared to negatively framed messages for promoting activity. Older adults ( N = 53, age 60-87 years) were introduced to six physical activity programmes that were randomly paired with either positively framed or negatively framed messages. Participants indicated how motivated they were to participate in each programme by providing ratings on attractiveness, suitability, capability and intention. They also completed surprise free recall and recognition tests. Respondents felt more motivated to participate in physical activity programmes paired with positively framed messages than in those with negatively framed ones. They also had better recognition memory for positively framed than negatively framed messages, and misremembered negatively framed messages to be positively framed. Findings support the notion that socioemotional selectivity theory-a theory of age-related changes in motivation-is a useful basis for health intervention design.
Message passing with a limited number of DMA byte counters

DOEpatents

Blocksome, Michael [Rochester, MN; Chen, Dong [Croton on Hudson, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Kumar, Sameer [White Plains, NY; Parker, Jeffrey J [Rochester, MN

2011-10-04

A method for passing messages in a parallel computer system constructed as a plurality of compute nodes interconnected as a network where each compute node includes a DMA engine but includes only a limited number of byte counters for tracking a number of bytes that are sent or received by the DMA engine, where the byte counters may be used in shared counter or exclusive counter modes of operation. The method includes using rendezvous protocol, a source compute node deterministically sending a request to send (RTS) message with a single RTS descriptor using an exclusive injection counter to track both the RTS message and message data to be sent in association with the RTS message, to a destination compute node such that the RTS descriptor indicates to the destination compute node that the message data will be adaptively routed to the destination node. Using one DMA FIFO at the source compute node, the RTS descriptors are maintained for rendezvous messages destined for the destination compute node to ensure proper message data ordering thereat. Using a reception counter at a DMA engine, the destination compute node tracks reception of the RTS and associated message data and sends a clear to send (CTS) message to the source node in a rendezvous protocol form of a remote get to accept the RTS message and message data and processing the remote get (CTS) by the source compute node DMA engine to provide the message data to be sent.
Data communications in a parallel active messaging interface of a parallel computer

DOEpatents

Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

2014-09-02

Eager send data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints that specify a client, a context, and a task, including receiving an eager send data communications instruction with transfer data disposed in a send buffer characterized by a read/write send buffer memory address in a read/write virtual address space of the origin endpoint; determining for the send buffer a read-only send buffer memory address in a read-only virtual address space, the read-only virtual address space shared by both the origin endpoint and the target endpoint, with all frames of physical memory mapped to pages of virtual memory in the read-only virtual address space; and communicating by the origin endpoint to the target endpoint an eager send message header that includes the read-only send buffer memory address.

Data communications in a parallel active messaging interface of a parallel computer

DOEpatents

Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

2014-09-16

Eager send data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints that specify a client, a context, and a task, including receiving an eager send data communications instruction with transfer data disposed in a send buffer characterized by a read/write send buffer memory address in a read/write virtual address space of the origin endpoint; determining for the send buffer a read-only send buffer memory address in a read-only virtual address space, the read-only virtual address space shared by both the origin endpoint and the target endpoint, with all frames of physical memory mapped to pages of virtual memory in the read-only virtual address space; and communicating by the origin endpoint to the target endpoint an eager send message header that includes the read-only send buffer memory address.
Ensuring correct rollback recovery in distributed shared memory systems

NASA Technical Reports Server (NTRS)

Janssens, Bob; Fuchs, W. Kent

1995-01-01

Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attractive platform for executing parallel scientific applications. Checkpointing and rollback techniques can be used in such a system to allow the computation to progress in spite of the temporary failure of one or more processing nodes. This paper presents the design of an independent checkpointing method for DSM that takes advantage of DSM's specific properties to reduce error-free and rollback overhead. The scheme reduces the dependencies that need to be considered for correct rollback to those resulting from transfers of pages. Furthermore, in-transit messages can be recovered without the use of logging. We extend the scheme to a DSM implementation using lazy release consistency, where the frequency of dependencies is further reduced.
Public-channel cryptography based on mutual chaos pass filters.

PubMed

Klein, Einat; Gross, Noam; Kopelowitz, Evi; Rosenbluh, Michael; Khaykovich, Lev; Kinzel, Wolfgang; Kanter, Ido

2006-10-01

We study the mutual coupling of chaotic lasers and observe both experimentally and in numeric simulations that there exists a regime of parameters for which two mutually coupled chaotic lasers establish isochronal synchronization, while a third laser coupled unidirectionally to one of the pair does not synchronize. We then propose a cryptographic scheme, based on the advantage of mutual coupling over unidirectional coupling, where all the parameters of the system are public knowledge. We numerically demonstrate that in such a scheme the two communicating lasers can add a message signal (compressed binary message) to the transmitted coupling signal and recover the message in both directions with high fidelity by using a mutual chaos pass filter procedure. An attacker, however, fails to recover an errorless message even if he amplifies the coupling signal.
Administering an epoch initiated for remote memory access

DOEpatents

Blocksome, Michael A; Miller, Douglas R

2014-03-18

Methods, systems, and products are disclosed for administering an epoch initiated for remote memory access that include: initiating, by an origin application messaging module on an origin compute node, one or more data transfers to a target compute node for the epoch; initiating, by the origin application messaging module after initiating the data transfers, a closing stage for the epoch, including rejecting any new data transfers after initiating the closing stage for the epoch; determining, by the origin application messaging module, whether the data transfers have completed; and closing, by the origin application messaging module, the epoch if the data transfers have completed.
Administering an epoch initiated for remote memory access

DOEpatents

Blocksome, Michael A; Miller, Douglas R

2012-10-23

Methods, systems, and products are disclosed for administering an epoch initiated for remote memory access that include: initiating, by an origin application messaging module on an origin compute node, one or more data transfers to a target compute node for the epoch; initiating, by the origin application messaging module after initiating the data transfers, a closing stage for the epoch, including rejecting any new data transfers after initiating the closing stage for the epoch; determining, by the origin application messaging module, whether the data transfers have completed; and closing, by the origin application messaging module, the epoch if the data transfers have completed.
Administering an epoch initiated for remote memory access

DOEpatents

Blocksome, Michael A.; Miller, Douglas R.

2013-01-01

Methods, systems, and products are disclosed for administering an epoch initiated for remote memory access that include: initiating, by an origin application messaging module on an origin compute node, one or more data transfers to a target compute node for the epoch; initiating, by the origin application messaging module after initiating the data transfers, a closing stage for the epoch, including rejecting any new data transfers after initiating the closing stage for the epoch; determining, by the origin application messaging module, whether the data transfers have completed; and closing, by the origin application messaging module, the epoch if the data transfers have completed.
Django Remote Submission

DOE Office of Scientific and Technical Information (OSTI.GOV)

Doucet, Mathieu; Hobson, Tanner C.; Ferraz Leal, Ricardo Miguel

The Django Remote Submission (DRS) is a Django (Django, n.d.) application to manage long running job submission, including starting the job, saving logs, and storing results. It is an independent project available as a standalone pypi package (PyPi, n.d.). It can be easily integrated in any Django project. The source code is freely available as a GitHub repository (django-remote-submission, n.d.). To run the jobs in background, DRS takes advantage of Celery (Celery, n.d.), a powerful asynchronous job queue used for running tasks in the background, and the Redis Server (Redis, n.d.), an in-memory data structure store. Celery uses brokers tomore » pass messages between a Django Project and the Celery workers. Redis is the message broker of DRS. In addition DRS provides real time monitoring of the progress of Jobs and associated logs. Through the Django Channels project (Channels, n.d.), and the usage of Web Sockets, it is possible to asynchronously display the Job Status and the live Job output (standard output and standard error) on a web page.« less
Django Remote Submission

DOE PAGES

Doucet, Mathieu; Hobson, Tanner C.; Ferraz Leal, Ricardo Miguel

2017-08-01

The Django Remote Submission (DRS) is a Django (Django, n.d.) application to manage long running job submission, including starting the job, saving logs, and storing results. It is an independent project available as a standalone pypi package (PyPi, n.d.). It can be easily integrated in any Django project. The source code is freely available as a GitHub repository (django-remote-submission, n.d.). To run the jobs in background, DRS takes advantage of Celery (Celery, n.d.), a powerful asynchronous job queue used for running tasks in the background, and the Redis Server (Redis, n.d.), an in-memory data structure store. Celery uses brokers tomore » pass messages between a Django Project and the Celery workers. Redis is the message broker of DRS. In addition DRS provides real time monitoring of the progress of Jobs and associated logs. Through the Django Channels project (Channels, n.d.), and the usage of Web Sockets, it is possible to asynchronously display the Job Status and the live Job output (standard output and standard error) on a web page.« less
Satellite telemetry: performance of animal-tracking systems

USGS Publications Warehouse

Keating, Kim A.; Brewster, Wayne G.; Key, Carl H.

1991-01-01

t: We used 10 Telonics ST-3 platform transmitter terminals (PTT's) configured for wolves and ungulates to examine the performance of the Argos satellite telemetry system. Under near-optimal conditions, 68 percentile errors for location qualities (NQ) 1, 2, and 3 were 1,188, 903, and 361 m, respectively. Errors (rE) exceeded expected values for NQ = 2 and 3, varied greatly among PTT's, increased as the difference (HE) between the estimated and actual PTT elevations increased, and were correlated nonlinearly with maximum satellite pass height (P,). We present a model of the relationships among rE, HE, and PH. Errors were bimodally distributed along the east-west axis and tended to occur away from the satellite when HE was positive. A southeasterly bias increased with HE, probably due to the particular distribution of satellite passes and effects of HE on rE. Under near-optimal conditions, 21 sensor message was received for up to 64% of available (PH, 50) satellite passes, and a location (NQ 2 1) was calculated for up to 63% of such passes. Sampling frequencies of sensor and location data declined 13 and 70%, respectively, for PTT's in a valley bottom and 65 and 86%, respectively, for PTT's on animals that were in valley bottoms. Sampling frequencies were greater for ungulate than for wolf collars.
An implementation and performance measurement of the progressive retry technique

NASA Technical Reports Server (NTRS)

Suri, Gaurav; Huang, Yennun; Wang, Yi-Min; Fuchs, W. Kent; Kintala, Chandra

1995-01-01

This paper describes a recovery technique called progressive retry for bypassing software faults in message-passing applications. The technique is implemented as reusable modules to provide application-level software fault tolerance. The paper describes the implementation of the technique and presents results from the application of progressive retry to two telecommunications systems. the results presented show that the technique is helpful in reducing the total recovery time for message-passing applications.
Belief propagation decoding of quantum channels by passing quantum messages

NASA Astrophysics Data System (ADS)

Renes, Joseph M.

2017-07-01

The belief propagation (BP) algorithm is a powerful tool in a wide range of disciplines from statistical physics to machine learning to computational biology, and is ubiquitous in decoding classical error-correcting codes. The algorithm works by passing messages between nodes of the factor graph associated with the code and enables efficient decoding of the channel, in some cases even up to the Shannon capacity. Here we construct the first BP algorithm which passes quantum messages on the factor graph and is capable of decoding the classical-quantum channel with pure state outputs. This gives explicit decoding circuits whose number of gates is quadratic in the code length. We also show that this decoder can be modified to work with polar codes for the pure state channel and as part of a decoder for transmitting quantum information over the amplitude damping channel. These represent the first explicit capacity-achieving decoders for non-Pauli channels.
LCAMP: Location Constrained Approximate Message Passing for Compressed Sensing MRI

PubMed Central

Sung, Kyunghyun; Daniel, Bruce L; Hargreaves, Brian A

2016-01-01

Iterative thresholding methods have been extensively studied as faster alternatives to convex optimization methods for solving large-sized problems in compressed sensing. A novel iterative thresholding method called LCAMP (Location Constrained Approximate Message Passing) is presented for reducing computational complexity and improving reconstruction accuracy when a nonzero location (or sparse support) constraint can be obtained from view shared images. LCAMP modifies the existing approximate message passing algorithm by replacing the thresholding stage with a location constraint, which avoids adjusting regularization parameters or thresholding levels. This work is first compared with other conventional reconstruction methods using random 1D signals and then applied to dynamic contrast-enhanced breast MRI to demonstrate the excellent reconstruction accuracy (less than 2% absolute difference) and low computation time (5 - 10 seconds using Matlab) with highly undersampled 3D data (244 × 128 × 48; overall reduction factor = 10). PMID:23042658
Efficient Tracing for On-the-Fly Space-Time Displays in a Debugger for Message Passing Programs

NASA Technical Reports Server (NTRS)

Hood, Robert; Matthews, Gregory

2001-01-01

In this work we describe the implementation of a practical mechanism for collecting and displaying trace information in a debugger for message passing programs. We introduce a trace format that is highly compressible while still providing information adequate for debugging purposes. We make the mechanism convenient for users to access by incorporating the trace collection in a set of wrappers for the MPI (message passing interface) communication library. We implement several debugger operations that use the trace display: consistent stoplines, undo, and rollback. They all are implemented using controlled replay, which executes at full speed in target processes until the appropriate position in the computation is reached. They provide convenient mechanisms for getting to places in the execution where the full power of a state-based debugger can be brought to bear on isolating communication errors.
Deterministic secure quantum communication using a single d-level system.

PubMed

Jiang, Dong; Chen, Yuanyuan; Gu, Xuemei; Xie, Ling; Chen, Lijun

2017-03-22

Deterministic secure quantum communication (DSQC) can transmit secret messages between two parties without first generating a shared secret key. Compared with quantum key distribution (QKD), DSQC avoids the waste of qubits arising from basis reconciliation and thus reaches higher efficiency. In this paper, based on data block transmission and order rearrangement technologies, we propose a DSQC protocol. It utilizes a set of single d-level systems as message carriers, which are used to directly encode the secret message in one communication process. Theoretical analysis shows that these employed technologies guarantee the security, and the use of a higher dimensional quantum system makes our protocol achieve higher security and efficiency. Since only quantum memory is required for implementation, our protocol is feasible with current technologies. Furthermore, Trojan horse attack (THA) is taken into account in our protocol. We give a THA model and show that THA significantly increases the multi-photon rate and can thus be detected.
TiD-Introducing and Benchmarking an Event-Delivery System for Brain-Computer Interfaces.

PubMed

Breitwieser, Christian; Tavella, Michele; Schreuder, Martijn; Cincotti, Febo; Leeb, Robert; Muller-Putz, Gernot R

2017-12-01

In this paper, we present and analyze an event distribution system for brain-computer interfaces. Events are commonly used to mark and describe incidents during an experiment and are therefore critical for later data analysis or immediate real-time processing. The presented approach, called Tools for brain-computer interaction interface D (TiD), delivers messages in XML format via a buslike system using transmission control protocol connections or shared memory. A dedicated server dispatches TiD messages to distributed or local clients. The TiD message is designed to be flexible and contains time stamps for event synchronization, whereas events describe incidents, which occur during an experiment. TiD was tested extensively toward stability and latency. The effect of an occurring event jitter was analyzed and benchmarked on a reference implementation under different conditions as gigabit and 100-Mb Ethernet or Wi-Fi with a different number of event receivers. A 3-dB signal attenuation, which occurs when averaging jitter influenced trials aligned by events, is starting to become visible at around 1-2 kHz in the case of a gigabit connection. Mean event distribution times across operating systems are ranging from 0.3 to 0.5ms for a gigabit network connection for 10 6 events. Results for other environmental conditions are available in this paper. References already using TiD for event distribution are provided showing the applicability of TiD for event delivery with distributed or local clients.
Shared reality in intergroup communication: Increasing the epistemic authority of an out-group audience.

PubMed

Echterhoff, Gerald; Kopietz, René; Higgins, E Tory

2017-06-01

Communicators typically tune messages to their audience's attitude. Such audience tuning biases communicators' memory for the topic toward the audience's attitude to the extent that they create a shared reality with the audience. To investigate shared reality in intergroup communication, we first established that a reduced memory bias after tuning messages to an out-group (vs. in-group) audience is a subtle index of communicators' denial of shared reality to that out-group audience (Experiments 1a and 1b). We then examined whether the audience-tuning memory bias might emerge when the out-group audience's epistemic authority is enhanced, either by increasing epistemic expertise concerning the communication topic or by creating epistemic consensus among members of a multiperson out-group audience. In Experiment 2, when Germans communicated to a Turkish audience with an attitude about a Turkish (vs. German) target, the audience-tuning memory bias appeared. In Experiment 3, when the audience of German communicators consisted of 3 Turks who all held the same attitude toward the target, the memory bias again appeared. The association between message valence and memory valence was consistently higher when the audience's epistemic authority was high (vs. low). An integrative analysis across all studies also suggested that the memory bias increases with increasing strength of epistemic inputs (epistemic expertise, epistemic consensus, and audience-tuned message production). The findings suggest novel ways of overcoming intergroup biases in intergroup relations. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines

PubMed Central

2011-01-01

Background Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. In computer science, the dataflow programming (DFP) paradigm defines software systems constructed in this manner, as networks of message-passing components. Thus, bioinformatic workflows can be naturally mapped onto DFP concepts. Results To enable the flexible creation and execution of bioinformatics dataflows, we have written a modular framework for parallel pipelines in Python ('PaPy'). A PaPy workflow is created from re-usable components connected by data-pipes into a directed acyclic graph, which together define nested higher-order map functions. The successive functional transformations of input data are evaluated on flexibly pooled compute resources, either local or remote. Input items are processed in batches of adjustable size, all flowing one to tune the trade-off between parallelism and lazy-evaluation (memory consumption). An add-on module ('NuBio') facilitates the creation of bioinformatics workflows by providing domain specific data-containers (e.g., for biomolecular sequences, alignments, structures) and functionality (e.g., to parse/write standard file formats). Conclusions PaPy offers a modular framework for the creation and deployment of parallel and distributed data-processing workflows. Pipelines derive their functionality from user-written, data-coupled components, so PaPy also can be viewed as a lightweight toolkit for extensible, flow-based bioinformatics data-processing. The simplicity and flexibility of distributed PaPy pipelines may help users bridge the gap between traditional desktop/workstation and grid computing. PaPy is freely distributed as open-source Python code at http://muralab.org/PaPy, and includes extensive documentation and annotated usage examples. PMID:21352538
A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines.

PubMed

Cieślik, Marcin; Mura, Cameron

2011-02-25

Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. In computer science, the dataflow programming (DFP) paradigm defines software systems constructed in this manner, as networks of message-passing components. Thus, bioinformatic workflows can be naturally mapped onto DFP concepts. To enable the flexible creation and execution of bioinformatics dataflows, we have written a modular framework for parallel pipelines in Python ('PaPy'). A PaPy workflow is created from re-usable components connected by data-pipes into a directed acyclic graph, which together define nested higher-order map functions. The successive functional transformations of input data are evaluated on flexibly pooled compute resources, either local or remote. Input items are processed in batches of adjustable size, all flowing one to tune the trade-off between parallelism and lazy-evaluation (memory consumption). An add-on module ('NuBio') facilitates the creation of bioinformatics workflows by providing domain specific data-containers (e.g., for biomolecular sequences, alignments, structures) and functionality (e.g., to parse/write standard file formats). PaPy offers a modular framework for the creation and deployment of parallel and distributed data-processing workflows. Pipelines derive their functionality from user-written, data-coupled components, so PaPy also can be viewed as a lightweight toolkit for extensible, flow-based bioinformatics data-processing. The simplicity and flexibility of distributed PaPy pipelines may help users bridge the gap between traditional desktop/workstation and grid computing. PaPy is freely distributed as open-source Python code at http://muralab.org/PaPy, and includes extensive documentation and annotated usage examples.
The Cortex Transform as an image preprocessor for sparse distributed memory: An initial study

NASA Technical Reports Server (NTRS)

Olshausen, Bruno; Watson, Andrew

1990-01-01

An experiment is described which was designed to evaluate the use of the Cortex Transform as an image processor for Sparse Distributed Memory (SDM). In the experiment, a set of images were injected with Gaussian noise, preprocessed with the Cortex Transform, and then encoded into bit patterns. The various spatial frequency bands of the Cortex Transform were encoded separately so that they could be evaluated based on their ability to properly cluster patterns belonging to the same class. The results of this study indicate that by simply encoding the low pass band of the Cortex Transform, a very suitable input representation for the SDM can be achieved.
Performance of hybrid programming models for multiscale cardiac simulations: preparing for petascale computation.

PubMed

Pope, Bernard J; Fitch, Blake G; Pitman, Michael C; Rice, John J; Reumann, Matthias

2011-10-01

Future multiscale and multiphysics models that support research into human disease, translational medical science, and treatment can utilize the power of high-performance computing (HPC) systems. We anticipate that computationally efficient multiscale models will require the use of sophisticated hybrid programming models, mixing distributed message-passing processes [e.g., the message-passing interface (MPI)] with multithreading (e.g., OpenMP, Pthreads). The objective of this study is to compare the performance of such hybrid programming models when applied to the simulation of a realistic physiological multiscale model of the heart. Our results show that the hybrid models perform favorably when compared to an implementation using only the MPI and, furthermore, that OpenMP in combination with the MPI provides a satisfactory compromise between performance and code complexity. Having the ability to use threads within MPI processes enables the sophisticated use of all processor cores for both computation and communication phases. Considering that HPC systems in 2012 will have two orders of magnitude more cores than what was used in this study, we believe that faster than real-time multiscale cardiac simulations can be achieved on these systems.

Petascale computation performance of lightweight multiscale cardiac models using hybrid programming models.

PubMed

Pope, Bernard J; Fitch, Blake G; Pitman, Michael C; Rice, John J; Reumann, Matthias

2011-01-01

Future multiscale and multiphysics models must use the power of high performance computing (HPC) systems to enable research into human disease, translational medical science, and treatment. Previously we showed that computationally efficient multiscale models will require the use of sophisticated hybrid programming models, mixing distributed message passing processes (e.g. the message passing interface (MPI)) with multithreading (e.g. OpenMP, POSIX pthreads). The objective of this work is to compare the performance of such hybrid programming models when applied to the simulation of a lightweight multiscale cardiac model. Our results show that the hybrid models do not perform favourably when compared to an implementation using only MPI which is in contrast to our results using complex physiological models. Thus, with regards to lightweight multiscale cardiac models, the user may not need to increase programming complexity by using a hybrid programming approach. However, considering that model complexity will increase as well as the HPC system size in both node count and number of cores per node, it is still foreseeable that we will achieve faster than real time multiscale cardiac simulations on these systems using hybrid programming models.
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations

PubMed Central

Páll, Szilárd; Fechner, Martin; Esztermann, Ansgar; de Groot, Bert L.; Grubmüller, Helmut

2015-01-01

The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well‐exploited with a combination of single instruction multiple data, multithreading, and message passing interface (MPI)‐based single program multiple data/multiple program multiple data parallelism while graphics processing units (GPUs) can be used as accelerators to compute interactions off‐loaded from the CPU. Here, we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance‐to‐price ratio, energy efficiency, and several other criteria. Although hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer‐class GPUs this improvement equally reflects in the performance‐to‐price ratio. Although memory issues in consumer‐class GPUs could pass unnoticed as these cards do not support error checking and correction memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost‐efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well‐balanced ratio of CPU and consumer‐class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime. © 2015 The Authors. Journal of Computational Chemistry Published by Wiley Periodicals, Inc. PMID:26238484
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations.

PubMed

Kutzner, Carsten; Páll, Szilárd; Fechner, Martin; Esztermann, Ansgar; de Groot, Bert L; Grubmüller, Helmut

2015-10-05

The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well-exploited with a combination of single instruction multiple data, multithreading, and message passing interface (MPI)-based single program multiple data/multiple program multiple data parallelism while graphics processing units (GPUs) can be used as accelerators to compute interactions off-loaded from the CPU. Here, we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance-to-price ratio, energy efficiency, and several other criteria. Although hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer-class GPUs this improvement equally reflects in the performance-to-price ratio. Although memory issues in consumer-class GPUs could pass unnoticed as these cards do not support error checking and correction memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost-efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well-balanced ratio of CPU and consumer-class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime. © 2015 The Authors. Journal of Computational Chemistry Published by Wiley Periodicals, Inc.
The graphical brain: Belief propagation and active inference

PubMed Central

Friston, Karl J.; Parr, Thomas; de Vries, Bert

2018-01-01

This paper considers functional integration in the brain from a computational perspective. We ask what sort of neuronal message passing is mandated by active inference—and what implications this has for context-sensitive connectivity at microscopic and macroscopic levels. In particular, we formulate neuronal processing as belief propagation under deep generative models. Crucially, these models can entertain both discrete and continuous states, leading to distinct schemes for belief updating that play out on the same (neuronal) architecture. Technically, we use Forney (normal) factor graphs to elucidate the requisite message passing in terms of its form and scheduling. To accommodate mixed generative models (of discrete and continuous states), one also has to consider link nodes or factors that enable discrete and continuous representations to talk to each other. When mapping the implicit computational architecture onto neuronal connectivity, several interesting features emerge. For example, Bayesian model averaging and comparison, which link discrete and continuous states, may be implemented in thalamocortical loops. These and other considerations speak to a computational connectome that is inherently state dependent and self-organizing in ways that yield to a principled (variational) account. We conclude with simulations of reading that illustrate the implicit neuronal message passing, with a special focus on how discrete (semantic) representations inform, and are informed by, continuous (visual) sampling of the sensorium. Author Summary This paper considers functional integration in the brain from a computational perspective. We ask what sort of neuronal message passing is mandated by active inference—and what implications this has for context-sensitive connectivity at microscopic and macroscopic levels. In particular, we formulate neuronal processing as belief propagation under deep generative models that can entertain both discrete and continuous states. This leads to distinct schemes for belief updating that play out on the same (neuronal) architecture. Technically, we use Forney (normal) factor graphs to characterize the requisite message passing, and link this formal characterization to canonical microcircuits and extrinsic connectivity in the brain. PMID:29417960
Distributed memory approaches for robotic neural controllers

NASA Technical Reports Server (NTRS)

Jorgensen, Charles C.

1990-01-01

The suitability is explored of two varieties of distributed memory neutral networks as trainable controllers for a simulated robotics task. The task requires that two cameras observe an arbitrary target point in space. Coordinates of the target on the camera image planes are passed to a neural controller which must learn to solve the inverse kinematics of a manipulator with one revolute and two prismatic joints. Two new network designs are evaluated. The first, radial basis sparse distributed memory (RBSDM), approximates functional mappings as sums of multivariate gaussians centered around previously learned patterns. The second network types involved variations of Adaptive Vector Quantizers or Self Organizing Maps. In these networks, random N dimensional points are given local connectivities. They are then exposed to training patterns and readjust their locations based on a nearest neighbor rule. Both approaches are tested based on their ability to interpolate manipulator joint coordinates for simulated arm movement while simultaneously performing stereo fusion of the camera data. Comparisons are made with classical k-nearest neighbor pattern recognition techniques.
Design and Implementation of an Operations Module for the ARGOS paperless Ship System

DTIC Science & Technology

1989-06-01

A. OPERATIONS STACK SCRIPTS SCRIPTS FOR STACK: operations * BACKGROUND #1: Operations * on openStack hide message box show menuBar pass openStack end... openStack ** CARD #1, BUTTON #1: Up ***** on mouseUp visual effect zoom out go to card id 10931 of stack argos end mouseUp ** CARD #1, BUTTON #2...STACK SCRIPTS SCRIPTS FOR STACK: Reports ** BACKGROUND #1: Operations * on openStack hie message box show menuBar pass openStack end openStack ** CARD #1
Barista: A Framework for Concurrent Speech Processing by USC-SAIL

PubMed Central

Can, Doğan; Gibson, James; Vaz, Colin; Georgiou, Panayiotis G.; Narayanan, Shrikanth S.

2016-01-01

We present Barista, an open-source framework for concurrent speech processing based on the Kaldi speech recognition toolkit and the libcppa actor library. With Barista, we aim to provide an easy-to-use, extensible framework for constructing highly customizable concurrent (and/or distributed) networks for a variety of speech processing tasks. Each Barista network specifies a flow of data between simple actors, concurrent entities communicating by message passing, modeled after Kaldi tools. Leveraging the fast and reliable concurrency and distribution mechanisms provided by libcppa, Barista lets demanding speech processing tasks, such as real-time speech recognizers and complex training workflows, to be scheduled and executed on parallel (and/or distributed) hardware. Barista is released under the Apache License v2.0. PMID:27610047
Barista: A Framework for Concurrent Speech Processing by USC-SAIL.

PubMed

Can, Doğan; Gibson, James; Vaz, Colin; Georgiou, Panayiotis G; Narayanan, Shrikanth S

2014-05-01

We present Barista, an open-source framework for concurrent speech processing based on the Kaldi speech recognition toolkit and the libcppa actor library. With Barista, we aim to provide an easy-to-use, extensible framework for constructing highly customizable concurrent (and/or distributed) networks for a variety of speech processing tasks. Each Barista network specifies a flow of data between simple actors, concurrent entities communicating by message passing, modeled after Kaldi tools. Leveraging the fast and reliable concurrency and distribution mechanisms provided by libcppa, Barista lets demanding speech processing tasks, such as real-time speech recognizers and complex training workflows, to be scheduled and executed on parallel (and/or distributed) hardware. Barista is released under the Apache License v2.0.
Suboptimal Exposure to Facial Expressions When Viewing Video Messages From a Small Screen: Effects on Emotion, Attention, and Memory

ERIC Educational Resources Information Center

Ravaja, Niklas; Kallinen, Kari; Saari, Timo; Keltikangas-Jarvinen, Liisa

2004-01-01

The authors examined the effects of suboptimally presented facial expressions on emotional and attentional responses and memory among 39 young adults viewing video (business news) messages from a small screen. Facial electromyography (EMG) and respiratory sinus arrhythmia were used as physiological measures of emotion and attention, respectively.…
The Effects of Involvement, Message Appeal, and Viewing Conditions on Memory and Evaluation of TV Commercials.

ERIC Educational Resources Information Center

Nowak, Glen; Thorson, Esther

A study tested an information processing model that incorporates the concepts of episodic and semantic memory. The model was designed to provide for the concurrent study of three advertising and communication variables: product involvement, message appeal, and distraction in viewing conditions. Among the five hypotheses being tested were that…
The influence of ATC message length and timing on pilot communication

NASA Technical Reports Server (NTRS)

Morrow, Daniel; Rodvold, Michelle

1993-01-01

Pilot-controller communication is critical to safe and efficient flight. It is often a challenging component of piloting, which is reflected in the number of incidents and accidents involving miscommunication. Our previous field study identified communication problems that disrupt routine communication between pilots and controllers. The present part-task simulation study followed up the field results with a more controlled investigation of communication problems. Pilots flew a simulation in which they were frequently vectored by Air Traffic Control (ATC), requiring intensive communication with the controller. While flying, pilots also performed a secondary visual monitoring task. We examined the influence of message length (one message with four commands vs. two messages with two commands each) and noncommunication workload on communication accuracy and length. Longer ATC messages appeared to overload pilot working memory, resulting in more incorrect or partial readbacks, as well as more requests to repeat the message. The timing between the two short messages also influenced communication. The second message interfered with memory for or response to the first short message when it was delivered too soon after the first message. Performing the secondary monitoring task did not influence communication. Instead, communication reduced monitoring accuracy.
Couriers in the Inca Empire: Getting Your Message Across. [Lesson Plan].

ERIC Educational Resources Information Center

2002

This lesson shows how the Inca communicated across the vast stretches of their mountain realm, the largest empire of the pre-industrial world. The lesson explains how couriers carried messages along mountain-ridge roads, up and down stone steps, and over chasm-spanning footbridges. It states that couriers could pass a message from Quito (Ecuador)…
An Application-Based Performance Characterization of the Columbia Supercluster

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Djomehri, Jahed M.; Hood, Robert; Jin, Hoaqiang; Kiris, Cetin; Saini, Subhash

2005-01-01

Columbia is a 10,240-processor supercluster consisting of 20 Altix nodes with 512 processors each, and currently ranked as the second-fastest computer in the world. In this paper, we present the performance characteristics of Columbia obtained on up to four computing nodes interconnected via the InfiniBand and/or NUMAlink4 communication fabrics. We evaluate floating-point performance, memory bandwidth, message passing communication speeds, and compilers using a subset of the HPC Challenge benchmarks, and some of the NAS Parallel Benchmarks including the multi-zone versions. We present detailed performance results for three scientific applications of interest to NASA, one from molecular dynamics, and two from computational fluid dynamics. Our results show that both the NUMAlink4 and the InfiniBand hold promise for application scaling to a large number of processors.
Hierarchical parallelisation of functional renormalisation group calculations - hp-fRG

NASA Astrophysics Data System (ADS)

Rohe, Daniel

2016-10-01

The functional renormalisation group (fRG) has evolved into a versatile tool in condensed matter theory for studying important aspects of correlated electron systems. Practical applications of the method often involve a high numerical effort, motivating the question in how far High Performance Computing (HPC) can leverage the approach. In this work we report on a multi-level parallelisation of the underlying computational machinery and show that this can speed up the code by several orders of magnitude. This in turn can extend the applicability of the method to otherwise inaccessible cases. We exploit three levels of parallelisation: Distributed computing by means of Message Passing (MPI), shared-memory computing using OpenMP, and vectorisation by means of SIMD units (single-instruction-multiple-data). Results are provided for two distinct High Performance Computing (HPC) platforms, namely the IBM-based BlueGene/Q system JUQUEEN and an Intel Sandy-Bridge-based development cluster. We discuss how certain issues and obstacles were overcome in the course of adapting the code. Most importantly, we conclude that this vast improvement can actually be accomplished by introducing only moderate changes to the code, such that this strategy may serve as a guideline for other researcher to likewise improve the efficiency of their codes.
A Comparison of PETSC Library and HPF Implementations of an Archetypal PDE Computation

NASA Technical Reports Server (NTRS)

Hayder, M. Ehtesham; Keyes, David E.; Mehrotra, Piyush

1997-01-01

Two paradigms for distributed-memory parallel computation that free the application programmer from the details of message passing are compared for an archetypal structured scientific computation a nonlinear, structured-grid partial differential equation boundary value problem using the same algorithm on the same hardware. Both paradigms, parallel libraries represented by Argonne's PETSC, and parallel languages represented by the Portland Group's HPF, are found to be easy to use for this problem class, and both are reasonably effective in exploiting concurrency after a short learning curve. The level of involvement required by the application programmer under either paradigm includes specification of the data partitioning (corresponding to a geometrically simple decomposition of the domain of the PDE). Programming in SPAM style for the PETSC library requires writing the routines that discretize the PDE and its Jacobian, managing subdomain-to-processor mappings (affine global- to-local index mappings), and interfacing to library solver routines. Programming for HPF requires a complete sequential implementation of the same algorithm, introducing concurrency through subdomain blocking (an effort similar to the index mapping), and modest experimentation with rewriting loops to elucidate to the compiler the latent concurrency. Correctness and scalability are cross-validated on up to 32 nodes of an IBM SP2.
Porting marine ecosystem model spin-up using transport matrices to GPUs

NASA Astrophysics Data System (ADS)

Siewertsen, E.; Piwonski, J.; Slawig, T.

2013-01-01

We have ported an implementation of the spin-up for marine ecosystem models based on transport matrices to graphics processing units (GPUs). The original implementation was designed for distributed-memory architectures and uses the Portable, Extensible Toolkit for Scientific Computation (PETSc) library that is based on the Message Passing Interface (MPI) standard. The spin-up computes a steady seasonal cycle of ecosystem tracers with climatological ocean circulation data as forcing. Since the transport is linear with respect to the tracers, the resulting operator is represented by matrices. Each iteration of the spin-up involves two matrix-vector multiplications and the evaluation of the used biogeochemical model. The original code was written in C and Fortran. On the GPU, we use the Compute Unified Device Architecture (CUDA) standard, a customized version of PETSc and a commercial CUDA Fortran compiler. We describe the extensions to PETSc and the modifications of the original C and Fortran codes that had to be done. Here we make use of freely available libraries for the GPU. We analyze the computational effort of the main parts of the spin-up for two exemplar ecosystem models and compare the overall computational time to those necessary on different CPUs. The results show that a consumer GPU can compete with a significant number of cluster CPUs without further code optimization.
Inference in the brain: Statistics flowing in redundant population codes

PubMed Central

Pitkow, Xaq; Angelaki, Dora E

2017-01-01

It is widely believed that the brain performs approximate probabilistic inference to estimate causal variables in the world from ambiguous sensory data. To understand these computations, we need to analyze how information is represented and transformed by the actions of nonlinear recurrent neural networks. We propose that these probabilistic computations function by a message-passing algorithm operating at the level of redundant neural populations. To explain this framework, we review its underlying concepts, including graphical models, sufficient statistics, and message-passing, and then describe how these concepts could be implemented by recurrently connected probabilistic population codes. The relevant information flow in these networks will be most interpretable at the population level, particularly for redundant neural codes. We therefore outline a general approach to identify the essential features of a neural message-passing algorithm. Finally, we argue that to reveal the most important aspects of these neural computations, we must study large-scale activity patterns during moderately complex, naturalistic behaviors. PMID:28595050
Constrained Multipoint Aerodynamic Shape Optimization Using an Adjoint Formulation and Parallel Computers

NASA Technical Reports Server (NTRS)

Reuther, James; Jameson, Antony; Alonso, Juan Jose; Rimlinger, Mark J.; Saunders, David

1997-01-01

An aerodynamic shape optimization method that treats the design of complex aircraft configurations subject to high fidelity computational fluid dynamics (CFD), geometric constraints and multiple design points is described. The design process will be greatly accelerated through the use of both control theory and distributed memory computer architectures. Control theory is employed to derive the adjoint differential equations whose solution allows for the evaluation of design gradient information at a fraction of the computational cost required by previous design methods. The resulting problem is implemented on parallel distributed memory architectures using a domain decomposition approach, an optimized communication schedule, and the MPI (Message Passing Interface) standard for portability and efficiency. The final result achieves very rapid aerodynamic design based on a higher order CFD method. In order to facilitate the integration of these high fidelity CFD approaches into future multi-disciplinary optimization (NW) applications, new methods must be developed which are capable of simultaneously addressing complex geometries, multiple objective functions, and geometric design constraints. In our earlier studies, we coupled the adjoint based design formulations with unconstrained optimization algorithms and showed that the approach was effective for the aerodynamic design of airfoils, wings, wing-bodies, and complex aircraft configurations. In many of the results presented in these earlier works, geometric constraints were satisfied either by a projection into feasible space or by posing the design space parameterization such that it automatically satisfied constraints. Furthermore, with the exception of reference 9 where the second author initially explored the use of multipoint design in conjunction with adjoint formulations, our earlier works have focused on single point design efforts. Here we demonstrate that the same methodology may be extended to treat complete configuration designs subject to multiple design points and geometric constraints. Examples are presented for both transonic and supersonic configurations ranging from wing alone designs to complex configuration designs involving wing, fuselage, nacelles and pylons.
Critical field-exponents for secure message-passing in modular networks

NASA Astrophysics Data System (ADS)

Shekhtman, Louis M.; Danziger, Michael M.; Bonamassa, Ivan; Buldyrev, Sergey V.; Caldarelli, Guido; Zlatić, Vinko; Havlin, Shlomo

2018-05-01

We study secure message-passing in the presence of multiple adversaries in modular networks. We assume a dominant fraction of nodes in each module have the same vulnerability, i.e., the same entity spying on them. We find both analytically and via simulations that the links between the modules (interlinks) have effects analogous to a magnetic field in a spin-system in that for any amount of interlinks the system no longer undergoes a phase transition. We then define the exponents δ, which relates the order parameter (the size of the giant secure component) at the critical point to the field strength (average number of interlinks per node), and γ, which describes the susceptibility near criticality. These are found to be δ = 2 and γ = 1 (with the scaling of the order parameter near the critical point given by β = 1). When two or more vulnerabilities are equally present in a module we find δ = 1 and γ = 0 (with β ≥ 2). Apart from defining a previously unidentified universality class, these exponents show that increasing connections between modules is more beneficial for security than increasing connections within modules. We also measure the correlation critical exponent ν, and the upper critical dimension d c , finding that ν {d}c=3 as for ordinary percolation, suggesting that for secure message-passing d c = 6. These results provide an interesting analogy between secure message-passing in modular networks and the physics of magnetic spin-systems.
Roofline model toolkit: A practical tool for architectural and program analysis

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lo, Yu Jung; Williams, Samuel; Van Straalen, Brian

We present preliminary results of the Roofline Toolkit for multicore, many core, and accelerated architectures. This paper focuses on the processor architecture characterization engine, a collection of portable instrumented micro benchmarks implemented with Message Passing Interface (MPI), and OpenMP used to express thread-level parallelism. These benchmarks are specialized to quantify the behavior of different architectural features. Compared to previous work on performance characterization, these microbenchmarks focus on capturing the performance of each level of the memory hierarchy, along with thread-level parallelism, instruction-level parallelism and explicit SIMD parallelism, measured in the context of the compilers and run-time environments. We also measuremore » sustained PCIe throughput with four GPU memory managed mechanisms. By combining results from the architecture characterization with the Roofline model based solely on architectural specifications, this work offers insights for performance prediction of current and future architectures and their software systems. To that end, we instrument three applications and plot their resultant performance on the corresponding Roofline model when run on a Blue Gene/Q architecture.« less

MLP: A Parallel Programming Alternative to MPI for New Shared Memory Parallel Systems

NASA Technical Reports Server (NTRS)

Taft, James R.

1999-01-01

Recent developments at the NASA AMES Research Center's NAS Division have demonstrated that the new generation of NUMA based Symmetric Multi-Processing systems (SMPs), such as the Silicon Graphics Origin 2000, can successfully execute legacy vector oriented CFD production codes at sustained rates far exceeding processing rates possible on dedicated 16 CPU Cray C90 systems. This high level of performance is achieved via shared memory based Multi-Level Parallelism (MLP). This programming approach, developed at NAS and outlined below, is distinct from the message passing paradigm of MPI. It offers parallelism at both the fine and coarse grained level, with communication latencies that are approximately 50-100 times lower than typical MPI implementations on the same platform. Such latency reductions offer the promise of performance scaling to very large CPU counts. The method draws on, but is also distinct from, the newly defined OpenMP specification, which uses compiler directives to support a limited subset of multi-level parallel operations. The NAS MLP method is general, and applicable to a large class of NASA CFD codes.
Access to Attitude-Relevant Information in Memory as a Determinant of Persuasion: The Role of Message and Communicator Attributes.

ERIC Educational Resources Information Center

Wood, Wendy; And Others

Research literature shows that people with access to attitude-relevant information in memory are able to draw on relevant beliefs and prior experiences when analyzing a persuasive message. This suggests that people who can retrieve little attitude-relevant information should be less able to engage in systematic processing. Two experiments were…
Memory for Physical Features of Discourse as a Function of Their Relevance.

ERIC Educational Resources Information Center

Fisher, Ronald P.; Cuervo, Asela

Memory for sex of the speaker and language of presentation of a spoken message was high and reliably better when the features were instrumental for comprehending the message than when they were not. This suggests that the physical characteristics of an event may be deeply or elaborately encoded when they are meaningful in light of the task…
Using PVM to host CLIPS in distributed environments

NASA Technical Reports Server (NTRS)

Myers, Leonard; Pohl, Kym

1994-01-01

It is relatively easy to enhance CLIPS (C Language Integrated Production System) to support multiple expert systems running in a distributed environment with heterogeneous machines. The task is minimized by using the PVM (Parallel Virtual Machine) code from Oak Ridge Labs to provide the distributed utility. PVM is a library of C and FORTRAN subprograms that supports distributive computing on many different UNIX platforms. A PVM deamon is easily installed on each CPU that enters the virtual machine environment. Any user with rsh or rexec access to a machine can use the one PVM deamon to obtain a generous set of distributed facilities. The ready availability of both CLIPS and PVM makes the combination of software particularly attractive for budget conscious experimentation of heterogeneous distributive computing with multiple CLIPS executables. This paper presents a design that is sufficient to provide essential message passing functions in CLIPS and enable the full range of PVM facilities.
Development of a Grid-Independent Geos-Chem Chemical Transport Model (v9-02) as an Atmospheric Chemistry Module for Earth System Models

NASA Technical Reports Server (NTRS)

Long, M. S.; Yantosca, R.; Nielsen, J. E; Keller, C. A.; Da Silva, A.; Sulprizio, M. P.; Pawson, S.; Jacob, D. J.

2015-01-01

The GEOS-Chem global chemical transport model (CTM), used by a large atmospheric chemistry research community, has been re-engineered to also serve as an atmospheric chemistry module for Earth system models (ESMs). This was done using an Earth System Modeling Framework (ESMF) interface that operates independently of the GEOSChem scientific code, permitting the exact same GEOSChem code to be used as an ESM module or as a standalone CTM. In this manner, the continual stream of updates contributed by the CTM user community is automatically passed on to the ESM module, which remains state of science and referenced to the latest version of the standard GEOS-Chem CTM. A major step in this re-engineering was to make GEOS-Chem grid independent, i.e., capable of using any geophysical grid specified at run time. GEOS-Chem data sockets were also created for communication between modules and with external ESM code. The grid-independent, ESMF-compatible GEOS-Chem is now the standard version of the GEOS-Chem CTM. It has been implemented as an atmospheric chemistry module into the NASA GEOS- 5 ESM. The coupled GEOS-5-GEOS-Chem system was tested for scalability and performance with a tropospheric oxidant-aerosol simulation (120 coupled species, 66 transported tracers) using 48-240 cores and message-passing interface (MPI) distributed-memory parallelization. Numerical experiments demonstrate that the GEOS-Chem chemistry module scales efficiently for the number of cores tested, with no degradation as the number of cores increases. Although inclusion of atmospheric chemistry in ESMs is computationally expensive, the excellent scalability of the chemistry module means that the relative cost goes down with increasing number of cores in a massively parallel environment.
The Automatic Parallelisation of Scientific Application Codes Using a Computer Aided Parallelisation Toolkit

NASA Technical Reports Server (NTRS)

Ierotheou, C.; Johnson, S.; Leggett, P.; Cross, M.; Evans, E.; Jin, Hao-Qiang; Frumkin, M.; Yan, J.; Biegel, Bryan (Technical Monitor)

2001-01-01

The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. Historically, the lack of a programming standard for using directives and the rather limited performance due to scalability have affected the take-up of this programming model approach. Significant progress has been made in hardware and software technologies, as a result the performance of parallel programs with compiler directives has also made improvements. The introduction of an industrial standard for shared-memory programming with directives, OpenMP, has also addressed the issue of portability. In this study, we have extended the computer aided parallelization toolkit (developed at the University of Greenwich), to automatically generate OpenMP based parallel programs with nominal user assistance. We outline the way in which loop types are categorized and how efficient OpenMP directives can be defined and placed using the in-depth interprocedural analysis that is carried out by the toolkit. We also discuss the application of the toolkit on the NAS Parallel Benchmarks and a number of real-world application codes. This work not only demonstrates the great potential of using the toolkit to quickly parallelize serial programs but also the good performance achievable on up to 300 processors for hybrid message passing and directive-based parallelizations.
Multiprocessor shared-memory information exchange

DOE Office of Scientific and Technical Information (OSTI.GOV)

Santoline, L.L.; Bowers, M.D.; Crew, A.W.

1989-02-01

In distributed microprocessor-based instrumentation and control systems, the inter-and intra-subsystem communication requirements ultimately form the basis for the overall system architecture. This paper describes a software protocol which addresses the intra-subsystem communications problem. Specifically the protocol allows for multiple processors to exchange information via a shared-memory interface. The authors primary goal is to provide a reliable means for information to be exchanged between central application processor boards (masters) and dedicated function processor boards (slaves) in a single computer chassis. The resultant Multiprocessor Shared-Memory Information Exchange (MSMIE) protocol, a standard master-slave shared-memory interface suitable for use in nuclear safety systems, ismore » designed to pass unidirectional buffers of information between the processors while providing a minimum, deterministic cycle time for this data exchange.« less
Requirements analysis for a hardware, discrete-event, simulation engine accelerator

NASA Astrophysics Data System (ADS)

Taylor, Paul J., Jr.

1991-12-01

An analysis of a general Discrete Event Simulation (DES), executing on the distributed architecture of an eight mode Intel PSC/2 hypercube, was performed. The most time consuming portions of the general DES algorithm were determined to be the functions associated with message passing of required simulation data between processing nodes of the hypercube architecture. A behavioral description, using the IEEE standard VHSIC Hardware Description and Design Language (VHDL), for a general DES hardware accelerator is presented. The behavioral description specifies the operational requirements for a DES coprocessor to augment the hypercube's execution of DES simulations. The DES coprocessor design implements the functions necessary to perform distributed discrete event simulations using a conservative time synchronization protocol.
Gene-network inference by message passing

NASA Astrophysics Data System (ADS)

Braunstein, A.; Pagnani, A.; Weigt, M.; Zecchina, R.

2008-01-01

The inference of gene-regulatory processes from gene-expression data belongs to the major challenges of computational systems biology. Here we address the problem from a statistical-physics perspective and develop a message-passing algorithm which is able to infer sparse, directed and combinatorial regulatory mechanisms. Using the replica technique, the algorithmic performance can be characterized analytically for artificially generated data. The algorithm is applied to genome-wide expression data of baker's yeast under various environmental conditions. We find clear cases of combinatorial control, and enrichment in common functional annotations of regulated genes and their regulators.
Optimal mapping of neural-network learning on message-passing multicomputers

NASA Technical Reports Server (NTRS)

Chu, Lon-Chan; Wah, Benjamin W.

1992-01-01

A minimization of learning-algorithm completion time is sought in the present optimal-mapping study of the learning process in multilayer feed-forward artificial neural networks (ANNs) for message-passing multicomputers. A novel approximation algorithm for mappings of this kind is derived from observations of the dominance of a parallel ANN algorithm over its communication time. Attention is given to both static and dynamic mapping schemes for systems with static and dynamic background workloads, as well as to experimental results obtained for simulated mappings on multicomputers with dynamic background workloads.
Trace-Driven Debugging of Message Passing Programs

NASA Technical Reports Server (NTRS)

Frumkin, Michael; Hood, Robert; Lopez, Louis; Bailey, David (Technical Monitor)

1998-01-01

In this paper we report on features added to a parallel debugger to simplify the debugging of parallel message passing programs. These features include replay, setting consistent breakpoints based on interprocess event causality, a parallel undo operation, and communication supervision. These features all use trace information collected during the execution of the program being debugged. We used a number of different instrumentation techniques to collect traces. We also implemented trace displays using two different trace visualization systems. The implementation was tested on an SGI Power Challenge cluster and a network of SGI workstations.
Multiple node remote messaging

DOEpatents

Blumrich, Matthias A.; Chen, Dong; Gara, Alan G.; Giampapa, Mark E.; Heidelberger, Philip; Ohmacht, Martin; Salapura, Valentina; Steinmacher-Burow, Burkhard; Vranas, Pavlos

2010-08-31

A method for passing remote messages in a parallel computer system formed as a network of interconnected compute nodes includes that a first compute node (A) sends a single remote message to a remote second compute node (B) in order to control the remote second compute node (B) to send at least one remote message. The method includes various steps including controlling a DMA engine at first compute node (A) to prepare the single remote message to include a first message descriptor and at least one remote message descriptor for controlling the remote second compute node (B) to send at least one remote message, including putting the first message descriptor into an injection FIFO at the first compute node (A) and sending the single remote message and the at least one remote message descriptor to the second compute node (B).
Fourier-based automatic alignment for improved Visual Cryptography schemes.

PubMed

Machizaud, Jacques; Chavel, Pierre; Fournel, Thierry

2011-11-07

In Visual Cryptography, several images, called "shadow images", that separately contain no information, are overlapped to reveal a shared secret message. We develop a method to digitally register one printed shadow image acquired by a camera with a purely digital shadow image, stored in memory. Using Fourier techniques derived from Fourier Optics concepts, the idea is to enhance and exploit the quasi periodicity of the shadow images, composed by a random distribution of black and white patterns on a periodic sampling grid. The advantage is to speed up the security control or the access time to the message, in particular in the cases of a small pixel size or of large numbers of pixels. Furthermore, the interest of visual cryptography can be increased by embedding the initial message in two shadow images that do not have identical mathematical supports, making manual registration impractical. Experimental results demonstrate the successful operation of the method, including the possibility to directly project the result onto the printed shadow image.
Deterministic secure quantum communication using a single d-level system

PubMed Central

Jiang, Dong; Chen, Yuanyuan; Gu, Xuemei; Xie, Ling; Chen, Lijun

2017-01-01

Deterministic secure quantum communication (DSQC) can transmit secret messages between two parties without first generating a shared secret key. Compared with quantum key distribution (QKD), DSQC avoids the waste of qubits arising from basis reconciliation and thus reaches higher efficiency. In this paper, based on data block transmission and order rearrangement technologies, we propose a DSQC protocol. It utilizes a set of single d-level systems as message carriers, which are used to directly encode the secret message in one communication process. Theoretical analysis shows that these employed technologies guarantee the security, and the use of a higher dimensional quantum system makes our protocol achieve higher security and efficiency. Since only quantum memory is required for implementation, our protocol is feasible with current technologies. Furthermore, Trojan horse attack (THA) is taken into account in our protocol. We give a THA model and show that THA significantly increases the multi-photon rate and can thus be detected. PMID:28327557
Low Power LDPC Code Decoder Architecture Based on Intermediate Message Compression Technique

NASA Astrophysics Data System (ADS)

Shimizu, Kazunori; Togawa, Nozomu; Ikenaga, Takeshi; Goto, Satoshi

Reducing the power dissipation for LDPC code decoder is a major challenging task to apply it to the practical digital communication systems. In this paper, we propose a low power LDPC code decoder architecture based on an intermediate message-compression technique which features as follows: (i) An intermediate message compression technique enables the decoder to reduce the required memory capacity and write power dissipation. (ii) A clock gated shift register based intermediate message memory architecture enables the decoder to decompress the compressed messages in a single clock cycle while reducing the read power dissipation. The combination of the above two techniques enables the decoder to reduce the power dissipation while keeping the decoding throughput. The simulation results show that the proposed architecture improves the power efficiency up to 52% and 18% compared to that of the decoder based on the overlapped schedule and the rapid convergence schedule without the proposed techniques respectively.
PhreeqcRM: A reaction module for transport simulators based on the geochemical model PHREEQC

USGS Publications Warehouse

Parkhurst, David L.; Wissmeier, Laurin

2015-01-01

PhreeqcRM is a geochemical reaction module designed specifically to perform equilibrium and kinetic reaction calculations for reactive transport simulators that use an operator-splitting approach. The basic function of the reaction module is to take component concentrations from the model cells of the transport simulator, run geochemical reactions, and return updated component concentrations to the transport simulator. If multicomponent diffusion is modeled (e.g., Nernst–Planck equation), then aqueous species concentrations can be used instead of component concentrations. The reaction capabilities are a complete implementation of the reaction capabilities of PHREEQC. In each cell, the reaction module maintains the composition of all of the reactants, which may include minerals, exchangers, surface complexers, gas phases, solid solutions, and user-defined kinetic reactants.PhreeqcRM assigns initial and boundary conditions for model cells based on standard PHREEQC input definitions (files or strings) of chemical compositions of solutions and reactants. Additional PhreeqcRM capabilities include methods to eliminate reaction calculations for inactive parts of a model domain, transfer concentrations and other model properties, and retrieve selected results. The module demonstrates good scalability for parallel processing by using multiprocessing with MPI (message passing interface) on distributed memory systems, and limited scalability using multithreading with OpenMP on shared memory systems. PhreeqcRM is written in C++, but interfaces allow methods to be called from C or Fortran. By using the PhreeqcRM reaction module, an existing multicomponent transport simulator can be extended to simulate a wide range of geochemical reactions. Results of the implementation of PhreeqcRM as the reaction engine for transport simulators PHAST and FEFLOW are shown by using an analytical solution and the reactive transport benchmark of MoMaS.
Implementing High-Performance Geometric Multigrid Solver with Naturally Grained Messages

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shan, Hongzhang; Williams, Samuel; Zheng, Yili

2015-10-26

Structured-grid linear solvers often require manually packing and unpacking of communication data to achieve high performance.Orchestrating this process efficiently is challenging, labor-intensive, and potentially error-prone.In this paper, we explore an alternative approach that communicates the data with naturally grained messagesizes without manual packing and unpacking. This approach is the distributed analogue of shared-memory programming, taking advantage of the global addressspace in PGAS languages to provide substantial programming ease. However, its performance may suffer from the large number of small messages. We investigate theruntime support required in the UPC ++ library for this naturally grained version to close the performance gapmore » between the two approaches and attain comparable performance at scale using the High-Performance Geometric Multgrid (HPGMG-FV) benchmark as a driver.« less
HyperForest: A high performance multi-processor architecture for real-time intelligent systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Garcia, P. Jr.; Rebeil, J.P.; Pollard, H.

1997-04-01

Intelligent Systems are characterized by the intensive use of computer power. The computer revolution of the last few years is what has made possible the development of the first generation of Intelligent Systems. Software for second generation Intelligent Systems will be more complex and will require more powerful computing engines in order to meet real-time constraints imposed by new robots, sensors, and applications. A multiprocessor architecture was developed that merges the advantages of message-passing and shared-memory structures: expendability and real-time compliance. The HyperForest architecture will provide an expandable real-time computing platform for computationally intensive Intelligent Systems and open the doorsmore » for the application of these systems to more complex tasks in environmental restoration and cleanup projects, flexible manufacturing systems, and DOE`s own production and disassembly activities.« less
Non-volatile memory for checkpoint storage

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blumrich, Matthias A.; Chen, Dong; Cipolla, Thomas M.

A system, method and computer program product for supporting system initiated checkpoints in high performance parallel computing systems and storing of checkpoint data to a non-volatile memory storage device. The system and method generates selective control signals to perform checkpointing of system related data in presence of messaging activity associated with a user application running at the node. The checkpointing is initiated by the system such that checkpoint data of a plurality of network nodes may be obtained even in the presence of user applications running on highly parallel computers that include ongoing user messaging activity. In one embodiment, themore » non-volatile memory is a pluggable flash memory card.« less
Non-invasive lightweight integration engine for building EHR from autonomous distributed systems.

PubMed

Angulo, Carlos; Crespo, Pere; Maldonado, José A; Moner, David; Pérez, Daniel; Abad, Irene; Mandingorra, Jesús; Robles, Montserrat

2007-12-01

In this paper we describe Pangea-LE, a message-oriented lightweight data integration engine that allows homogeneous and concurrent access to clinical information from disperse and heterogeneous data sources. The engine extracts the information and passes it to the requesting client applications in a flexible XML format. The XML response message can be formatted on demand by appropriate Extensible Stylesheet Language (XSL) transformations in order to meet the needs of client applications. We also present a real deployment in a hospital where Pangea-LE collects and generates an XML view of all the available patient clinical information. The information is presented to healthcare professionals in an Electronic Health Record (EHR) viewer Web application with patient search and EHR browsing capabilities. Implantation in a real setting has been a success due to the non-invasive nature of Pangea-LE which respects the existing information systems.

Three-pass protocol scheme for bitmap image security by using vernam cipher algorithm

NASA Astrophysics Data System (ADS)

Rachmawati, D.; Budiman, M. A.; Aulya, L.

2018-02-01

Confidentiality, integrity, and efficiency are the crucial aspects of data security. Among the other digital data, image data is too prone to abuse of operation like duplication, modification, etc. There are some data security techniques, one of them is cryptography. The security of Vernam Cipher cryptography algorithm is very dependent on the key exchange process. If the key is leaked, security of this algorithm will collapse. Therefore, a method that minimizes key leakage during the exchange of messages is required. The method which is used, is known as Three-Pass Protocol. This protocol enables message delivery process without the key exchange. Therefore, the sending messages process can reach the receiver safely without fear of key leakage. The system is built by using Java programming language. The materials which are used for system testing are image in size 200×200 pixel, 300×300 pixel, 500×500 pixel, 800×800 pixel and 1000×1000 pixel. The result of experiments showed that Vernam Cipher algorithm in Three-Pass Protocol scheme could restore the original image.
Ordering Traces Logically to Identify Lateness in Message Passing Programs

DOE PAGES

Isaacs, Katherine E.; Gamblin, Todd; Bhatele, Abhinav; ...

2015-03-30

Event traces are valuable for understanding the behavior of parallel programs. However, automatically analyzing a large parallel trace is difficult, especially without a specific objective. We aid this endeavor by extracting a trace's logical structure, an ordering of trace events derived from happened-before relationships, while taking into account developer intent. Using this structure, we can calculate an operation's delay relative to its peers on other processes. The logical structure also serves as a platform for comparing and clustering processes as well as highlighting communication patterns in a trace visualization. We present an algorithm for determining this idealized logical structure frommore » traces of message passing programs, and we develop metrics to quantify delays and differences among processes. We implement our techniques in Ravel, a parallel trace visualization tool that displays both logical and physical timelines. Rather than showing the duration of each operation, we display where delays begin and end, and how they propagate. As a result, we apply our approach to the traces of several message passing applications, demonstrating the accuracy of our extracted structure and its utility in analyzing these codes.« less
Impact of the infectious period on epidemics

NASA Astrophysics Data System (ADS)

Wilkinson, Robert R.; Sharkey, Kieran J.

2018-05-01

The duration of the infectious period is a crucial determinant of the ability of an infectious disease to spread. We consider an epidemic model that is network based and non-Markovian, containing classic Kermack-McKendrick, pairwise, message passing, and spatial models as special cases. For this model, we prove a monotonic relationship between the variability of the infectious period (with fixed mean) and the probability that the infection will reach any given subset of the population by any given time. For certain families of distributions, this result implies that epidemic severity is decreasing with respect to the variance of the infectious period. The striking importance of this relationship is demonstrated numerically. We then prove, with a fixed basic reproductive ratio (R0), a monotonic relationship between the variability of the posterior transmission probability (which is a function of the infectious period) and the probability that the infection will reach any given subset of the population by any given time. Thus again, even when R0 is fixed, variability of the infectious period tends to dampen the epidemic. Numerical results illustrate this but indicate the relationship is weaker. We then show how our results apply to message passing, pairwise, and Kermack-McKendrick epidemic models, even when they are not exactly consistent with the stochastic dynamics. For Poissonian contact processes, and arbitrarily distributed infectious periods, we demonstrate how systems of delay differential equations and ordinary differential equations can provide upper and lower bounds, respectively, for the probability that any given individual has been infected by any given time.
Distributed computing system with dual independent communications paths between computers and employing split tokens

NASA Technical Reports Server (NTRS)

Rasmussen, Robert D. (Inventor); Manning, Robert M. (Inventor); Lewis, Blair F. (Inventor); Bolotin, Gary S. (Inventor); Ward, Richard S. (Inventor)

1990-01-01

This is a distributed computing system providing flexible fault tolerance; ease of software design and concurrency specification; and dynamic balance of the loads. The system comprises a plurality of computers each having a first input/output interface and a second input/output interface for interfacing to communications networks each second input/output interface including a bypass for bypassing the associated computer. A global communications network interconnects the first input/output interfaces for providing each computer the ability to broadcast messages simultaneously to the remainder of the computers. A meshwork communications network interconnects the second input/output interfaces providing each computer with the ability to establish a communications link with another of the computers bypassing the remainder of computers. Each computer is controlled by a resident copy of a common operating system. Communications between respective ones of computers is by means of split tokens each having a moving first portion which is sent from computer to computer and a resident second portion which is disposed in the memory of at least one of computer and wherein the location of the second portion is part of the first portion. The split tokens represent both functions to be executed by the computers and data to be employed in the execution of the functions. The first input/output interfaces each include logic for detecting a collision between messages and for terminating the broadcasting of a message whereby collisions between messages are detected and avoided.
TravelAid : in-vehicle signing and variable speed limit evaluation

DOT National Transportation Integrated Search

2001-12-01

This report discusses the effectiveness of using variable message signs (VMSs) and in-vehicle traffic advisory systems (IVUs) on a mountainous pass (Snoqualmie Pass on Interstate 90 in Washington State) for changing driver behavior. As part of this p...
Holographic disk with high data transfer rate: its application to an audio response memory.

PubMed

Kubota, K; Ono, Y; Kondo, M; Sugama, S; Nishida, N; Sakaguchi, M

1980-03-15

This paper describes a memory realized with a high data transfer rate using the holographic parallel-processing function and its application to an audio response system that supplies many audio messages to many terminals simultaneously. Digitalized audio messages are recorded as tiny 1-D Fourier transform holograms on a holographic disk. A hologram recorder and a hologram reader were constructed to test and demonstrate the holographic audio response memory feasibility. Experimental results indicate the potentiality of an audio response system with a 2000-word vocabulary and 250-Mbit/sec bit transfer rate.
Implementing TCP/IP and a socket interface as a server in a message-passing operating system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hipp, E.; Wiltzius, D.

1990-03-01

The UNICOS 4.3BSD network code and socket transport interface are the basis of an explicit network server for NLTSS, a message passing operating system on the Cray YMP. A BSD socket user library provides access to the network server using an RPC mechanism. The advantages of this server methodology are its modularity and extensibility to migrate to future protocol suites (e.g. OSI) and transport interfaces. In addition, the network server is implemented in an explicit multi-tasking environment to take advantage of the Cray YMP multi-processor platform. 19 refs., 5 figs.
Complexity of parallel implementation of domain decomposition techniques for elliptic partial differential equations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gropp, W.D.; Keyes, D.E.

1988-03-01

The authors discuss the parallel implementation of preconditioned conjugate gradient (PCG)-based domain decomposition techniques for self-adjoint elliptic partial differential equations in two dimensions on several architectures. The complexity of these methods is described on a variety of message-passing parallel computers as a function of the size of the problem, number of processors and relative communication speeds of the processors. They show that communication startups are very important, and that even the small amount of global communication in these methods can significantly reduce the performance of many message-passing architectures.
n-body simulations using message passing parallel computers.

NASA Astrophysics Data System (ADS)

Grama, A. Y.; Kumar, V.; Sameh, A.

The authors present new parallel formulations of the Barnes-Hut method for n-body simulations on message passing computers. These parallel formulations partition the domain efficiently incurring minimal communication overhead. This is in contrast to existing schemes that are based on sorting a large number of keys or on the use of global data structures. The new formulations are augmented by alternate communication strategies which serve to minimize communication overhead. The impact of these communication strategies is experimentally studied. The authors report on experimental results obtained from an astrophysical simulation on an nCUBE2 parallel computer.
Implementation of a Message Passing Interface into a Cloud-Resolving Model for Massively Parallel Computing

NASA Technical Reports Server (NTRS)

Juang, Hann-Ming Henry; Tao, Wei-Kuo; Zeng, Xi-Ping; Shie, Chung-Lin; Simpson, Joanne; Lang, Steve

2004-01-01

The capability for massively parallel programming (MPP) using a message passing interface (MPI) has been implemented into a three-dimensional version of the Goddard Cumulus Ensemble (GCE) model. The design for the MPP with MPI uses the concept of maintaining similar code structure between the whole domain as well as the portions after decomposition. Hence the model follows the same integration for single and multiple tasks (CPUs). Also, it provides for minimal changes to the original code, so it is easily modified and/or managed by the model developers and users who have little knowledge of MPP. The entire model domain could be sliced into one- or two-dimensional decomposition with a halo regime, which is overlaid on partial domains. The halo regime requires that no data be fetched across tasks during the computational stage, but it must be updated before the next computational stage through data exchange via MPI. For reproducible purposes, transposing data among tasks is required for spectral transform (Fast Fourier Transform, FFT), which is used in the anelastic version of the model for solving the pressure equation. The performance of the MPI-implemented codes (i.e., the compressible and anelastic versions) was tested on three different computing platforms. The major results are: 1) both versions have speedups of about 99% up to 256 tasks but not for 512 tasks; 2) the anelastic version has better speedup and efficiency because it requires more computations than that of the compressible version; 3) equal or approximately-equal numbers of slices between the x- and y- directions provide the fastest integration due to fewer data exchanges; and 4) one-dimensional slices in the x-direction result in the slowest integration due to the need for more memory relocation for computation.
Message passing interface and multithreading hybrid for parallel molecular docking of large databases on petascale high performance computing machines.

PubMed

Zhang, Xiaohua; Wong, Sergio E; Lightstone, Felice C

2013-04-30

A mixed parallel scheme that combines message passing interface (MPI) and multithreading was implemented in the AutoDock Vina molecular docking program. The resulting program, named VinaLC, was tested on the petascale high performance computing (HPC) machines at Lawrence Livermore National Laboratory. To exploit the typical cluster-type supercomputers, thousands of docking calculations were dispatched by the master process to run simultaneously on thousands of slave processes, where each docking calculation takes one slave process on one node, and within the node each docking calculation runs via multithreading on multiple CPU cores and shared memory. Input and output of the program and the data handling within the program were carefully designed to deal with large databases and ultimately achieve HPC on a large number of CPU cores. Parallel performance analysis of the VinaLC program shows that the code scales up to more than 15K CPUs with a very low overhead cost of 3.94%. One million flexible compound docking calculations took only 1.4 h to finish on about 15K CPUs. The docking accuracy of VinaLC has been validated against the DUD data set by the re-docking of X-ray ligands and an enrichment study, 64.4% of the top scoring poses have RMSD values under 2.0 Å. The program has been demonstrated to have good enrichment performance on 70% of the targets in the DUD data set. An analysis of the enrichment factors calculated at various percentages of the screening database indicates VinaLC has very good early recovery of actives. Copyright © 2013 Wiley Periodicals, Inc.
PARALLEL HOP: A SCALABLE HALO FINDER FOR MASSIVE COSMOLOGICAL DATA SETS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Skory, Stephen; Turk, Matthew J.; Norman, Michael L.

2010-11-15

Modern N-body cosmological simulations contain billions (10{sup 9}) of dark matter particles. These simulations require hundreds to thousands of gigabytes of memory and employ hundreds to tens of thousands of processing cores on many compute nodes. In order to study the distribution of dark matter in a cosmological simulation, the dark matter halos must be identified using a halo finder, which establishes the halo membership of every particle in the simulation. The resources required for halo finding are similar to the requirements for the simulation itself. In particular, simulations have become too extensive to use commonly employed halo finders, suchmore » that the computational requirements to identify halos must now be spread across multiple nodes and cores. Here, we present a scalable-parallel halo finding method called Parallel HOP for large-scale cosmological simulation data. Based on the halo finder HOP, it utilizes message passing interface and domain decomposition to distribute the halo finding workload across multiple compute nodes, enabling analysis of much larger data sets than is possible with the strictly serial or previous parallel implementations of HOP. We provide a reference implementation of this method as a part of the toolkit {sup yt}, an analysis toolkit for adaptive mesh refinement data that include complementary analysis modules. Additionally, we discuss a suite of benchmarks that demonstrate that this method scales well up to several hundred tasks and data sets in excess of 2000{sup 3} particles. The Parallel HOP method and our implementation can be readily applied to any kind of N-body simulation data and is therefore widely applicable.« less
Reliable broadcast protocols

NASA Technical Reports Server (NTRS)

Joseph, T. A.; Birman, Kenneth P.

1989-01-01

A number of broadcast protocols that are reliable subject to a variety of ordering and delivery guarantees are considered. Developing applications that are distributed over a number of sites and/or must tolerate the failures of some of them becomes a considerably simpler task when such protocols are available for communication. Without such protocols the kinds of distributed applications that can reasonably be built will have a very limited scope. As the trend towards distribution and decentralization continues, it will not be surprising if reliable broadcast protocols have the same role in distributed operating systems of the future that message passing mechanisms have in the operating systems of today. On the other hand, the problems of engineering such a system remain large. For example, deciding which protocol is the most appropriate to use in a certain situation or how to balance the latency-communication-storage costs is not an easy question.
Parallel Simulation of Unsteady Turbulent Flames

NASA Technical Reports Server (NTRS)

Menon, Suresh

1996-01-01

Time-accurate simulation of turbulent flames in high Reynolds number flows is a challenging task since both fluid dynamics and combustion must be modeled accurately. To numerically simulate this phenomenon, very large computer resources (both time and memory) are required. Although current vector supercomputers are capable of providing adequate resources for simulations of this nature, the high cost and their limited availability, makes practical use of such machines less than satisfactory. At the same time, the explicit time integration algorithms used in unsteady flow simulations often possess a very high degree of parallelism, making them very amenable to efficient implementation on large-scale parallel computers. Under these circumstances, distributed memory parallel computers offer an excellent near-term solution for greatly increased computational speed and memory, at a cost that may render the unsteady simulations of the type discussed above more feasible and affordable.This paper discusses the study of unsteady turbulent flames using a simulation algorithm that is capable of retaining high parallel efficiency on distributed memory parallel architectures. Numerical studies are carried out using large-eddy simulation (LES). In LES, the scales larger than the grid are computed using a time- and space-accurate scheme, while the unresolved small scales are modeled using eddy viscosity based subgrid models. This is acceptable for the moment/energy closure since the small scales primarily provide a dissipative mechanism for the energy transferred from the large scales. However, for combustion to occur, the species must first undergo mixing at the small scales and then come into molecular contact. Therefore, global models cannot be used. Recently, a new model for turbulent combustion was developed, in which the combustion is modeled, within the subgrid (small-scales) using a methodology that simulates the mixing and the molecular transport and the chemical kinetics within each LES grid cell. Finite-rate kinetics can be included without any closure and this approach actually provides a means to predict the turbulent rates and the turbulent flame speed. The subgrid combustion model requires resolution of the local time scales associated with small-scale mixing, molecular diffusion and chemical kinetics and, therefore, within each grid cell, a significant amount of computations must be carried out before the large-scale (LES resolved) effects are incorporated. Therefore, this approach is uniquely suited for parallel processing and has been implemented on various systems such as: Intel Paragon, IBM SP-2, Cray T3D and SGI Power Challenge (PC) using the system independent Message Passing Interface (MPI) compiler. In this paper, timing data on these machines is reported along with some characteristic results.
Flood predictions using the parallel version of distributed numerical physical rainfall-runoff model TOPKAPI

NASA Astrophysics Data System (ADS)

Boyko, Oleksiy; Zheleznyak, Mark

2015-04-01

The original numerical code TOPKAPI-IMMS of the distributed rainfall-runoff model TOPKAPI ( Todini et al, 1996-2014) is developed and implemented in Ukraine. The parallel version of the code has been developed recently to be used on multiprocessors systems - multicore/processors PC and clusters. Algorithm is based on binary-tree decomposition of the watershed for the balancing of the amount of computation for all processors/cores. Message passing interface (MPI) protocol is used as a parallel computing framework. The numerical efficiency of the parallelization algorithms is demonstrated for the case studies for the flood predictions of the mountain watersheds of the Ukrainian Carpathian regions. The modeling results is compared with the predictions based on the lumped parameters models.
Audience-tuning effects on memory: the role of shared reality.

PubMed

Echterhoff, Gerald; Higgins, E Tory; Groll, Stephan

2005-09-01

After tuning to an audience, communicators' own memories for the topic often reflect the biased view expressed in their messages. Three studies examined explanations for this bias. Memories for a target person were biased when feedback signaled the audience's successful identification of the target but not after failed identification (Experiment 1). Whereas communicators tuning to an in-group audience exhibited the bias, communicators tuning to an out-group audience did not (Experiment 2). These differences did not depend on communicators' mood but were mediated by communicators' trust in their audience's judgment about other people (Experiments 2 and 3). Message and memory were more closely associated for high than for low trusters. Apparently, audience-tuning effects depend on the communicators' experience of a shared reality.
OFF, Open source Finite volume Fluid dynamics code: A free, high-order solver based on parallel, modular, object-oriented Fortran API

NASA Astrophysics Data System (ADS)

Zaghi, S.

2014-07-01

OFF, an open source (free software) code for performing fluid dynamics simulations, is presented. The aim of OFF is to solve, numerically, the unsteady (and steady) compressible Navier-Stokes equations of fluid dynamics by means of finite volume techniques: the research background is mainly focused on high-order (WENO) schemes for multi-fluids, multi-phase flows over complex geometries. To this purpose a highly modular, object-oriented application program interface (API) has been developed. In particular, the concepts of data encapsulation and inheritance available within Fortran language (from standard 2003) have been stressed in order to represent each fluid dynamics "entity" (e.g. the conservative variables of a finite volume, its geometry, etc…) by a single object so that a large variety of computational libraries can be easily (and efficiently) developed upon these objects. The main features of OFF can be summarized as follows: Programming LanguageOFF is written in standard (compliant) Fortran 2003; its design is highly modular in order to enhance simplicity of use and maintenance without compromising the efficiency; Parallel Frameworks Supported the development of OFF has been also targeted to maximize the computational efficiency: the code is designed to run on shared-memory multi-cores workstations and distributed-memory clusters of shared-memory nodes (supercomputers); the code's parallelization is based on Open Multiprocessing (OpenMP) and Message Passing Interface (MPI) paradigms; Usability, Maintenance and Enhancement in order to improve the usability, maintenance and enhancement of the code also the documentation has been carefully taken into account; the documentation is built upon comprehensive comments placed directly into the source files (no external documentation files needed): these comments are parsed by means of doxygen free software producing high quality html and latex documentation pages; the distributed versioning system referred as git has been adopted in order to facilitate the collaborative maintenance and improvement of the code; CopyrightsOFF is a free software that anyone can use, copy, distribute, study, change and improve under the GNU Public License version 3. The present paper is a manifesto of OFF code and presents the currently implemented features and ongoing developments. This work is focused on the computational techniques adopted and a detailed description of the main API characteristics is reported. OFF capabilities are demonstrated by means of one and two dimensional examples and a three dimensional real application.
Parallel Numerical Simulations of Water Reservoirs

NASA Astrophysics Data System (ADS)

Torres, Pedro; Mangiavacchi, Norberto

2010-11-01

The study of the water flow and scalar transport in water reservoirs is important for the determination of the water quality during the initial stages of the reservoir filling and during the life of the reservoir. For this scope, a parallel 2D finite element code for solving the incompressible Navier-Stokes equations coupled with scalar transport was implemented using the message-passing programming model, in order to perform simulations of hidropower water reservoirs in a computer cluster environment. The spatial discretization is based on the MINI element that satisfies the Babuska-Brezzi (BB) condition, which provides sufficient conditions for a stable mixed formulation. All the distributed data structures needed in the different stages of the code, such as preprocessing, solving and post processing, were implemented using the PETSc library. The resulting linear systems for the velocity and the pressure fields were solved using the projection method, implemented by an approximate block LU factorization. In order to increase the parallel performance in the solution of the linear systems, we employ the static condensation method for solving the intermediate velocity at vertex and centroid nodes separately. We compare performance results of the static condensation method with the approach of solving the complete system. In our tests the static condensation method shows better performance for large problems, at the cost of an increased memory usage. Performance results for other intensive parts of the code in a computer cluster are also presented.
On the use of flux limiters in the discrete ordinates method for 3D radiation calculations in absorbing and scattering media

NASA Astrophysics Data System (ADS)

Godoy, William F.; DesJardin, Paul E.

2010-05-01

The application of flux limiters to the discrete ordinates method (DOM), SN, for radiative transfer calculations is discussed and analyzed for 3D enclosures for cases in which the intensities are strongly coupled to each other such as: radiative equilibrium and scattering media. A Newton-Krylov iterative method (GMRES) solves the final systems of linear equations along with a domain decomposition strategy for parallel computation using message passing libraries in a distributed memory system. Ray effects due to angular discretization and errors due to domain decomposition are minimized until small variations are introduced by these effects in order to focus on the influence of flux limiters on errors due to spatial discretization, known as numerical diffusion, smearing or false scattering. Results are presented for the DOM-integrated quantities such as heat flux, irradiation and emission. A variety of flux limiters are compared to "exact" solutions available in the literature, such as the integral solution of the RTE for pure absorbing-emitting media and isotropic scattering cases and a Monte Carlo solution for a forward scattering case. Additionally, a non-homogeneous 3D enclosure is included to extend the use of flux limiters to more practical cases. The overall balance of convergence, accuracy, speed and stability using flux limiters is shown to be superior compared to step schemes for any test case.
Rated Measures of Narrative Structure for Written Smoking-Cessation Texts

PubMed Central

Sanders-Jackson, Ashley

2014-01-01

This article describes the effect of a series of rated measures of narrative structure on recognition memory, agreement on story-relevant beliefs, and intention to engage in a health-related behavior—in this case smoking cessation. Using short smoking-cessation stories as stimuli, data were collected in a nationally representative sample of adult smokers (n = 1,312). Results suggested that messages rated as more sequential improved encoding and messages rated as containing more context decreased encoding. Messages rated high in transportation were associated with increased recognition, agreement with story-relevant beliefs, and intention to quit. Both positive and negative emotion were positively associated with intention to quit, but were negatively associated with recognition memory. PMID:24447036

Relatively effortless listening promotes understanding and recall of medical instructions in older adults

PubMed Central

DiDonato, Roberta M.; Surprenant, Aimée M.

2015-01-01

Communication success under adverse conditions requires efficient and effective recruitment of both bottom-up (sensori-perceptual) and top-down (cognitive-linguistic) resources to decode the intended auditory-verbal message. Employing these limited capacity resources has been shown to vary across the lifespan, with evidence indicating that younger adults out-perform older adults for both comprehension and memory of the message. This study examined how sources of interference arising from the speaker (message spoken with conversational vs. clear speech technique), the listener (hearing-listening and cognitive-linguistic factors), and the environment (in competing speech babble noise vs. quiet) interact and influence learning and memory performance using more ecologically valid methods than has been done previously. The results suggest that when older adults listened to complex medical prescription instructions with “clear speech,” (presented at audible levels through insertion earphones) their learning efficiency, immediate, and delayed memory performance improved relative to their performance when they listened with a normal conversational speech rate (presented at audible levels in sound field). This better learning and memory performance for clear speech listening was maintained even in the presence of speech babble noise. The finding that there was the largest learning-practice effect on 2nd trial performance in the conversational speech when the clear speech listening condition was first is suggestive of greater experience-dependent perceptual learning or adaptation to the speaker's speech and voice pattern in clear speech. This suggests that experience-dependent perceptual learning plays a role in facilitating the language processing and comprehension of a message and subsequent memory encoding. PMID:26106353
A fast parallel 3D Poisson solver with longitudinal periodic and transverse open boundary conditions for space-charge simulations

NASA Astrophysics Data System (ADS)

Qiang, Ji

2017-10-01

A three-dimensional (3D) Poisson solver with longitudinal periodic and transverse open boundary conditions can have important applications in beam physics of particle accelerators. In this paper, we present a fast efficient method to solve the Poisson equation using a spectral finite-difference method. This method uses a computational domain that contains the charged particle beam only and has a computational complexity of O(Nu(logNmode)) , where Nu is the total number of unknowns and Nmode is the maximum number of longitudinal or azimuthal modes. This saves both the computational time and the memory usage of using an artificial boundary condition in a large extended computational domain. The new 3D Poisson solver is parallelized using a message passing interface (MPI) on multi-processor computers and shows a reasonable parallel performance up to hundreds of processor cores.
Tolerant (parallel) Programming

NASA Technical Reports Server (NTRS)

DiNucci, David C.; Bailey, David H. (Technical Monitor)

1997-01-01

In order to be truly portable, a program must be tolerant of a wide range of development and execution environments, and a parallel program is just one which must be tolerant of a very wide range. This paper first defines the term "tolerant programming", then describes many layers of tools to accomplish it. The primary focus is on F-Nets, a formal model for expressing computation as a folded partial-ordering of operations, thereby providing an architecture-independent expression of tolerant parallel algorithms. For implementing F-Nets, Cooperative Data Sharing (CDS) is a subroutine package for implementing communication efficiently in a large number of environments (e.g. shared memory and message passing). Software Cabling (SC), a very-high-level graphical programming language for building large F-Nets, possesses many of the features normally expected from today's computer languages (e.g. data abstraction, array operations). Finally, L2(sup 3) is a CASE tool which facilitates the construction, compilation, execution, and debugging of SC programs.
A 3D staggered-grid finite difference scheme for poroelastic wave equation

NASA Astrophysics Data System (ADS)

Zhang, Yijie; Gao, Jinghuai

2014-10-01

Three dimensional numerical modeling has been a viable tool for understanding wave propagation in real media. The poroelastic media can better describe the phenomena of hydrocarbon reservoirs than acoustic and elastic media. However, the numerical modeling in 3D poroelastic media demands significantly more computational capacity, including both computational time and memory. In this paper, we present a 3D poroelastic staggered-grid finite difference (SFD) scheme. During the procedure, parallel computing is implemented to reduce the computational time. Parallelization is based on domain decomposition, and communication between processors is performed using message passing interface (MPI). Parallel analysis shows that the parallelized SFD scheme significantly improves the simulation efficiency and 3D decomposition in domain is the most efficient. We also analyze the numerical dispersion and stability condition of the 3D poroelastic SFD method. Numerical results show that the 3D numerical simulation can provide a real description of wave propagation.
Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Wang, Yi-Min

1993-01-01

Checkpointing and rollback recovery are techniques that can provide efficient recovery from transient process failures. In a message-passing system, the rollback of a message sender may cause the rollback of the corresponding receiver, and the system needs to roll back to a consistent set of checkpoints called recovery line. If the processes are allowed to take uncoordinated checkpoints, the above rollback propagation may result in the domino effect which prevents recovery line progression. Traditionally, only obsolete checkpoints before the global recovery line can be discarded, and the necessary and sufficient condition for identifying all garbage checkpoints has remained an open problem. A necessary and sufficient condition for achieving optimal garbage collection is derived and it is proved that the number of useful checkpoints is bounded by N(N+1)/2, where N is the number of processes. The approach is based on the maximum-sized antichain model of consistent global checkpoints and the technique of recovery line transformation and decomposition. It is also shown that, for systems requiring message logging to record in-transit messages, the same approach can be used to achieve optimal message log reclamation. As a final topic, a unifying framework is described by considering checkpoint coordination and exploiting piecewise determinism as mechanisms for bounding rollback propagation, and the applicability of the optimal garbage collection algorithm to domino-free recovery protocols is demonstrated.
Relations Among Central Auditory Abilities, Socio-Economic Factors, Speech Delay, Phonic Abilities and Reading Achievement: A Longitudinal Study.

ERIC Educational Resources Information Center

Flowers, Arthur; Crandell, Edwin W.

Three auditory perceptual processes (resistance to distortion, selective listening in the form of auditory dedifferentiation, and binaural synthesis) were evaluated by five assessment techniques: (1) low pass filtered speech, (2) accelerated speech, (3) competing messages, (4) accelerated plus competing messages, and (5) binaural synthesis.…
Quantum cluster variational method and message passing algorithms revisited

NASA Astrophysics Data System (ADS)

Domínguez, E.; Mulet, Roberto

2018-02-01

We present a general framework to study quantum disordered systems in the context of the Kikuchi's cluster variational method (CVM). The method relies in the solution of message passing-like equations for single instances or in the iterative solution of complex population dynamic algorithms for an average case scenario. We first show how a standard application of the Kikuchi's CVM can be easily translated to message passing equations for specific instances of the disordered system. We then present an "ad hoc" extension of these equations to a population dynamic algorithm representing an average case scenario. At the Bethe level, these equations are equivalent to the dynamic population equations that can be derived from a proper cavity ansatz. However, at the plaquette approximation, the interpretation is more subtle and we discuss it taking also into account previous results in classical disordered models. Moreover, we develop a formalism to properly deal with the average case scenario using a replica-symmetric ansatz within this CVM for quantum disordered systems. Finally, we present and discuss numerical solutions of the different approximations for the quantum transverse Ising model and the quantum random field Ising model in two-dimensional lattices.
Applying the Concept of Working Memory to Foreign Language Listening Comprehension.

ERIC Educational Resources Information Center

Service, Elisabet

Memory research has recently moved from looking at performance in highly artificial laboratory tasks to examination of tasks in everyday life. One consequence is the development of the concept of "working memory." For the learner, foreign language comprehension makes great demands on working memory capacity. Comprehension of a message requires…
Flexible language constructs for large parallel programs

NASA Technical Reports Server (NTRS)

Rosing, Matthew; Schnabel, Robert

1993-01-01

The goal of the research described is to develop flexible language constructs for writing large data parallel numerical programs for distributed memory (MIMD) multiprocessors. Previously, several models have been developed to support synchronization and communication. Models for global synchronization include SIMD (Single Instruction Multiple Data), SPMD (Single Program Multiple Data), and sequential programs annotated with data distribution statements. The two primary models for communication include implicit communication based on shared memory and explicit communication based on messages. None of these models by themselves seem sufficient to permit the natural and efficient expression of the variety of algorithms that occur in large scientific computations. An overview of a new language that combines many of these programming models in a clean manner is given. This is done in a modular fashion such that different models can be combined to support large programs. Within a module, the selection of a model depends on the algorithm and its efficiency requirements. An overview of the language and discussion of some of the critical implementation details is given.
Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications

NASA Astrophysics Data System (ADS)

Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.

2015-06-01

The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version with auto-vectorisation and also shared memory approach. In this scenario GPU computing is the best option since it provides a homogeneous behaviour. More specifically, the speedup of GPU computing achieves an upper limit of 12 for both one and two GPUs, whereas the performance reaches peak values of 80 GFlops and 146 GFlops for the performance for one GPU and two GPUs respectively. Finally, the method is applied to an earth crust profile in order to demonstrate the potential of our approach and the necessity of applying acceleration strategies in these type of applications.
SU-F-BRD-09: A Random Walk Model Algorithm for Proton Dose Calculation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yao, W; Farr, J

2015-06-15

Purpose: To develop a random walk model algorithm for calculating proton dose with balanced computation burden and accuracy. Methods: Random walk (RW) model is sometimes referred to as a density Monte Carlo (MC) simulation. In MC proton dose calculation, the use of Gaussian angular distribution of protons due to multiple Coulomb scatter (MCS) is convenient, but in RW the use of Gaussian angular distribution requires an extremely large computation and memory. Thus, our RW model adopts spatial distribution from the angular one to accelerate the computation and to decrease the memory usage. From the physics and comparison with the MCmore » simulations, we have determined and analytically expressed those critical variables affecting the dose accuracy in our RW model. Results: Besides those variables such as MCS, stopping power, energy spectrum after energy absorption etc., which have been extensively discussed in literature, the following variables were found to be critical in our RW model: (1) inverse squared law that can significantly reduce the computation burden and memory, (2) non-Gaussian spatial distribution after MCS, and (3) the mean direction of scatters at each voxel. In comparison to MC results, taken as reference, for a water phantom irradiated by mono-energetic proton beams from 75 MeV to 221.28 MeV, the gamma test pass rate was 100% for the 2%/2mm/10% criterion. For a highly heterogeneous phantom consisting of water embedded by a 10 cm cortical bone and a 10 cm lung in the Bragg peak region of the proton beam, the gamma test pass rate was greater than 98% for the 3%/3mm/10% criterion. Conclusion: We have determined key variables in our RW model for proton dose calculation. Compared with commercial pencil beam algorithms, our RW model much improves the dose accuracy in heterogeneous regions, and is about 10 times faster than MC simulations.« less
Low latency, high bandwidth data communications between compute nodes in a parallel computer

DOEpatents

Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

2010-11-02

Methods, parallel computers, and computer program products are disclosed for low latency, high bandwidth data communications between compute nodes in a parallel computer. Embodiments include receiving, by an origin direct memory access (`DMA`) engine of an origin compute node, data for transfer to a target compute node; sending, by the origin DMA engine of the origin compute node to a target DMA engine on the target compute node, a request to send (`RTS`) message; transferring, by the origin DMA engine, a predetermined portion of the data to the target compute node using memory FIFO operation; determining, by the origin DMA engine whether an acknowledgement of the RTS message has been received from the target DMA engine; if the an acknowledgement of the RTS message has not been received, transferring, by the origin DMA engine, another predetermined portion of the data to the target compute node using a memory FIFO operation; and if the acknowledgement of the RTS message has been received by the origin DMA engine, transferring, by the origin DMA engine, any remaining portion of the data to the target compute node using a direct put operation.
Direct memory access transfer completion notification

DOEpatents

Archer, Charles J [Rochester, MN; Blocksome, Michael A [Rochester, MN; Parker, Jeffrey J [Rochester, MN

2011-02-15

DMA transfer completion notification includes: inserting, by an origin DMA engine on an origin node in an injection first-in-first-out (`FIFO`) buffer, a data descriptor for an application message to be transferred to a target node on behalf of an application on the origin node; inserting, by the origin DMA engine, a completion notification descriptor in the injection FIFO buffer after the data descriptor for the message, the completion notification descriptor specifying a packet header for a completion notification packet; transferring, by the origin DMA engine to the target node, the message in dependence upon the data descriptor; sending, by the origin DMA engine, the completion notification packet to a local reception FIFO buffer using a local memory FIFO transfer operation; and notifying, by the origin DMA engine, the application that transfer of the message is complete in response to receiving the completion notification packet in the local reception FIFO buffer.
Learning classifier systems for single and multiple mobile robots in unstructured environments

NASA Astrophysics Data System (ADS)

Bay, John S.

1995-12-01

The learning classifier system (LCS) is a learning production system that generates behavioral rules via an underlying discovery mechanism. The LCS architecture operates similarly to a blackboard architecture; i.e., by posted-message communications. But in the LCS, the message board is wiped clean at every time interval, thereby requiring no persistent shared resource. In this paper, we adapt the LCS to the problem of mobile robot navigation in completely unstructured environments. We consider the model of the robot itself, including its sensor and actuator structures, to be part of this environment, in addition to the world-model that includes a goal and obstacles at unknown locations. This requires a robot to learn its own I/O characteristics in addition to solving its navigation problem, but results in a learning controller that is equally applicable, unaltered, in robots with a wide variety of kinematic structures and sensing capabilities. We show the effectiveness of this LCS-based controller through both simulation and experimental trials with a small robot. We then propose a new architecture, the Distributed Learning Classifier System (DLCS), which generalizes the message-passing behavior of the LCS from internal messages within a single agent to broadcast massages among multiple agents. This communications mode requires little bandwidth and is easily implemented with inexpensive, off-the-shelf hardware. The DLCS is shown to have potential application as a learning controller for multiple intelligent agents.
ADHydro: A Parallel Implementation of a Large-scale High-Resolution Multi-Physics Distributed Water Resources Model Using the Charm++ Run Time System

NASA Astrophysics Data System (ADS)

Steinke, R. C.; Ogden, F. L.; Lai, W.; Moreno, H. A.; Pureza, L. G.

2014-12-01

Physics-based watershed models are useful tools for hydrologic studies, water resources management and economic analyses in the contexts of climate, land-use, and water-use changes. This poster presents a parallel implementation of a quasi 3-dimensional, physics-based, high-resolution, distributed water resources model suitable for simulating large watersheds in a massively parallel computing environment. Developing this model is one of the objectives of the NSF EPSCoR RII Track II CI-WATER project, which is joint between Wyoming and Utah EPSCoR jurisdictions. The model, which we call ADHydro, is aimed at simulating important processes in the Rocky Mountain west, including: rainfall and infiltration, snowfall and snowmelt in complex terrain, vegetation and evapotranspiration, soil heat flux and freezing, overland flow, channel flow, groundwater flow, water management and irrigation. Model forcing is provided by the Weather Research and Forecasting (WRF) model, and ADHydro is coupled with the NOAH-MP land-surface scheme for calculating fluxes between the land and atmosphere. The ADHydro implementation uses the Charm++ parallel run time system. Charm++ is based on location transparent message passing between migrateable C++ objects. Each object represents an entity in the model such as a mesh element. These objects can be migrated between processors or serialized to disk allowing the Charm++ system to automatically provide capabilities such as load balancing and checkpointing. Objects interact with each other by passing messages that the Charm++ system routes to the correct destination object regardless of its current location. This poster discusses the algorithms, communication patterns, and caching strategies used to implement ADHydro with Charm++. The ADHydro model code will be released to the hydrologic community in late 2014.
Weighted community detection and data clustering using message passing

NASA Astrophysics Data System (ADS)

Shi, Cheng; Liu, Yanchen; Zhang, Pan

2018-03-01

Grouping objects into clusters based on the similarities or weights between them is one of the most important problems in science and engineering. In this work, by extending message-passing algorithms and spectral algorithms proposed for an unweighted community detection problem, we develop a non-parametric method based on statistical physics, by mapping the problem to the Potts model at the critical temperature of spin-glass transition and applying belief propagation to solve the marginals corresponding to the Boltzmann distribution. Our algorithm is robust to over-fitting and gives a principled way to determine whether there are significant clusters in the data and how many clusters there are. We apply our method to different clustering tasks. In the community detection problem in weighted and directed networks, we show that our algorithm significantly outperforms existing algorithms. In the clustering problem, where the data were generated by mixture models in the sparse regime, we show that our method works all the way down to the theoretical limit of detectability and gives accuracy very close to that of the optimal Bayesian inference. In the semi-supervised clustering problem, our method only needs several labels to work perfectly in classic datasets. Finally, we further develop Thouless-Anderson-Palmer equations which heavily reduce the computation complexity in dense networks but give almost the same performance as belief propagation.
A concurrent distributed system for aircraft tactical decision generation

NASA Technical Reports Server (NTRS)

Mcmanus, John W.

1990-01-01

A research program investigating the use of AI techniques to aid in the development of a tactical decision generator (TDG) for within visual range (WVR) air combat engagements is discussed. The application of AI programming and problem-solving methods in the development and implementation of a concurrent version of the computerized logic for air-to-air warfare simulations (CLAWS) program, a second-generation TDG, is presented. Concurrent computing environments and programming approaches are discussed, and the design and performance of prototype concurrent TDG system (Cube CLAWS) are presented. It is concluded that the Cube CLAWS has provided a useful testbed to evaluate the development of a distributed blackboard system. The project has shown that the complexity of developing specialized software on a distributed, message-passing architecture such as the Hypercube is not overwhelming, and that reasonable speedups and processor efficiency can be achieved by a distributed blackboard system. The project has also highlighted some of the costs of using a distributed approach to designing a blackboard system.
Re-Presenting Subversive Songs: Applying Strategies for Invention and Arrangement to Nontraditional Speech Texts

ERIC Educational Resources Information Center

Charlesworth, Dacia

2010-01-01

Invention deals with the content of a speech, arrangement involves placing the content in an order that is most strategic, style focuses on selecting linguistic devices, such as metaphor, to make the message more appealing, memory assists the speaker in delivering the message correctly, and delivery ideally enables great reception of the message.…
Effects of redundancy in the comparison of speech and pictorial displays in the cockpit environment.

PubMed

Byblow, W D

1990-06-01

Synthesised speech and pictorial displays were compared in a spatially compatible simulated cockpit environment. Messages of high or low levels of redundancy were presented to subjects in both modality conditions. Subjects responded to warnings presented in a warning-only condition and in a dual-task condition, in which a simulated flight task was performed with visual and manual input/output modalities. Because the amount of information presented in most real-world applications and experimental paradigms is quantifiably large with respect to present guidelines for the use of synthesised speech warnings, the low-redundancy condition was hypothesised to allow for better performance. Results showed that subjects respond quicker to messages of low redundancy in both modalities. It is suggested that speech messages with low-redundancy levels were effective in minimising message length and ensuring that messages did not overload the short-term memory required to process and maintain speech in memory. Manipulation of phrase structure was used to optimise message redundancy and enhance the conceptual compatibility of the message without increasing message length or imposing a perceptual cost or memory overload. The results also suggest that system response times were quicker when synthesised speech warnings were used. This result is consistent with predictions from multiple resource theory which states that the resources required for the perception of verbal warnings are different from those for the flight task. It is also suggested that the perception of a pictorial display requires the same resources used for the perception of the primary flight task. An alternative explanation is that pictorial displays impose a visual scanning cost which is responsible for decreased performance. Based on the findings reported here, it is suggested that speech displays be incorporated in a spatially compatible cockpit environment because they allow equal or better performance when compared with pictorial displays. More importantly, the amount of time that the operator must direct his vision away from information vital to the flight task is decreased.
"A Message for My Brother": The Vietnam Veterans' Memorial as Rhetorical Situation.

ERIC Educational Resources Information Center

Carlson, A. Cheree; Hocking, John E.

An examination of letters left at the Vietnam Veteran's Memorial in Washington, D.C. between November, 1984 and April, 1986 revealed that the memorial serves as a rhetorical situation that urges its visitors to eloquence. The memorial is an excellent proving ground for situational theory because the interaction of site and perception is vital to…

A weighted belief-propagation algorithm for estimating volume-related properties of random polytopes

NASA Astrophysics Data System (ADS)

Font-Clos, Francesc; Massucci, Francesco Alessandro; Pérez Castillo, Isaac

2012-11-01

In this work we introduce a novel weighted message-passing algorithm based on the cavity method for estimating volume-related properties of random polytopes, properties which are relevant in various research fields ranging from metabolic networks, to neural networks, to compressed sensing. We propose, as opposed to adopting the usual approach consisting in approximating the real-valued cavity marginal distributions by a few parameters, using an algorithm to faithfully represent the entire marginal distribution. We explain various alternatives for implementing the algorithm and benchmarking the theoretical findings by showing concrete applications to random polytopes. The results obtained with our approach are found to be in very good agreement with the estimates produced by the Hit-and-Run algorithm, known to produce uniform sampling.
Direct memory access transfer completion notification

DOEpatents

Chen, Dong; Giampapa, Mark E.; Heidelberger, Philip; Kumar, Sameer; Parker, Jeffrey J.; Steinmacher-Burow, Burkhard D.; Vranas, Pavlos

2010-07-27

Methods, compute nodes, and computer program products are provided for direct memory access (`DMA`) transfer completion notification. Embodiments include determining, by an origin DMA engine on an origin compute node, whether a data descriptor for an application message to be sent to a target compute node is currently in an injection first-in-first-out (`FIFO`) buffer in dependence upon a sequence number previously associated with the data descriptor, the total number of descriptors currently in the injection FIFO buffer, and the current sequence number for the newest data descriptor stored in the injection FIFO buffer; and notifying a processor core on the origin DMA engine that the message has been sent if the data descriptor for the message is not currently in the injection FIFO buffer.
Asbestos: Securing Untrusted Software with Interposition

DTIC Science & Technology

2005-09-01

consistent intelligible interfaces to different types of resource. Message-based operating systems, such as Accent, Amoeba, Chorus, L4 , Spring...control on self-authenticating capabilities, precluding policies that restrict delegation. L4 uses a strict hierarchy of interpositions, useful for...the OS de- sign space amenable to secure application construction. Similar effects might be possible with message-passing microkernels , or unwieldy
Distributed Computation of the knn Graph for Large High-Dimensional Point Sets

PubMed Central

Plaku, Erion; Kavraki, Lydia E.

2009-01-01

High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors. PMID:19847318
Remote direct memory access

DOEpatents

Archer, Charles J.; Blocksome, Michael A.

2012-12-11

Methods, parallel computers, and computer program products are disclosed for remote direct memory access. Embodiments include transmitting, from an origin DMA engine on an origin compute node to a plurality target DMA engines on target compute nodes, a request to send message, the request to send message specifying a data to be transferred from the origin DMA engine to data storage on each target compute node; receiving, by each target DMA engine on each target compute node, the request to send message; preparing, by each target DMA engine, to store data according to the data storage reference and the data length, including assigning a base storage address for the data storage reference; sending, by one or more of the target DMA engines, an acknowledgment message acknowledging that all the target DMA engines are prepared to receive a data transmission from the origin DMA engine; receiving, by the origin DMA engine, the acknowledgement message from the one or more of the target DMA engines; and transferring, by the origin DMA engine, data to data storage on each of the target compute nodes according to the data storage reference using a single direct put operation.
Communication Optimizations for a Wireless Distributed Prognostic Framework

NASA Technical Reports Server (NTRS)

Saha, Sankalita; Saha, Bhaskar; Goebel, Kai

2009-01-01

Distributed architecture for prognostics is an essential step in prognostic research in order to enable feasible real-time system health management. Communication overhead is an important design problem for such systems. In this paper we focus on communication issues faced in the distributed implementation of an important class of algorithms for prognostics - particle filters. In spite of being computation and memory intensive, particle filters lend well to distributed implementation except for one significant step - resampling. We propose new resampling scheme called parameterized resampling that attempts to reduce communication between collaborating nodes in a distributed wireless sensor network. Analysis and comparison with relevant resampling schemes is also presented. A battery health management system is used as a target application. A new resampling scheme for distributed implementation of particle filters has been discussed in this paper. Analysis and comparison of this new scheme with existing resampling schemes in the context for minimizing communication overhead have also been discussed. Our proposed new resampling scheme performs significantly better compared to other schemes by attempting to reduce both the communication message length as well as number total communication messages exchanged while not compromising prediction accuracy and precision. Future work will explore the effects of the new resampling scheme in the overall computational performance of the whole system as well as full implementation of the new schemes on the Sun SPOT devices. Exploring different network architectures for efficient communication is an importance future research direction as well.
A Memory Pacer for Improving Stimulus Generalization.

ERIC Educational Resources Information Center

Browning, Ellen R.

1983-01-01

The Memory Tracer, which provides prompts by playing prerecorded appropriate messages, was effective in maintaining a low rate of negative verbalizations by six adolescents with autism, schizophrenia, and severe behavior problems. (CL) 4B
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

NASA Technical Reports Server (NTRS)

Jost, Gabriele; Jin, Hao-Qiang; anMey, Dieter; Hatay, Ferhat F.

2003-01-01

Clusters of SMP (Symmetric Multi-Processors) nodes provide support for a wide range of parallel programming paradigms. The shared address space within each node is suitable for OpenMP parallelization. Message passing can be employed within and across the nodes of a cluster. Multiple levels of parallelism can be achieved by combining message passing and OpenMP parallelization. Which programming paradigm is the best will depend on the nature of the given problem, the hardware components of the cluster, the network, and the available software. In this study we compare the performance of different implementations of the same CFD benchmark application, using the same numerical algorithm but employing different programming paradigms.
ADAPTIVE TETRAHEDRAL GRID REFINEMENT AND COARSENING IN MESSAGE-PASSING ENVIRONMENTS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hallberg, J.; Stagg, A.

2000-10-01

A grid refinement and coarsening scheme has been developed for tetrahedral and triangular grid-based calculations in message-passing environments. The element adaption scheme is based on an edge bisection of elements marked for refinement by an appropriate error indicator. Hash-table/linked-list data structures are used to store nodal and element formation. The grid along inter-processor boundaries is refined and coarsened consistently with the update of these data structures via MPI calls. The parallel adaption scheme has been applied to the solution of a transient, three-dimensional, nonlinear, groundwater flow problem. Timings indicate efficiency of the grid refinement process relative to the flow solvermore » calculations.« less
Balancing Contention and Synchronization on the Intel Paragon

NASA Technical Reports Server (NTRS)

Bokhari, Shahid H.; Nicol, David M.

1996-01-01

The Intel Paragon is a mesh-connected distributed memory parallel computer. It uses an oblivious and deterministic message routing algorithm: this permits us to develop highly optimized schedules for frequently needed communication patterns. The complete exchange is one such pattern. Several approaches are available for carrying it out on the mesh. We study an algorithm developed by Scott. This algorithm assumes that a communication link can carry one message at a time and that a node can only transmit one message at a time. It requires global synchronization to enforce a schedule of transmissions. Unfortunately global synchronization has substantial overhead on the Paragon. At the same time the powerful interconnection mechanism of this machine permits 2 or 3 messages to share a communication link with minor overhead. It can also overlap multiple message transmission from the same node to some extent. We develop a generalization of Scott's algorithm that executes complete exchange with a prescribed contention. Schedules that incur greater contention require fewer synchronization steps. This permits us to tradeoff contention against synchronization overhead. We describe the performance of this algorithm and compare it with Scott's original algorithm as well as with a naive algorithm that does not take interconnection structure into account. The Bounded contention algorithm is always better than Scott's algorithm and outperforms the naive algorithm for all but the smallest message sizes. The naive algorithm fails to work on meshes larger than 12 x 12. These results show that due consideration of processor interconnect and machine performance parameters is necessary to obtain peak performance from the Paragon and its successor mesh machines.
A self-defining hierarchical data system

NASA Technical Reports Server (NTRS)

Bailey, J.

1992-01-01

The Self-Defining Data System (SDS) is a system which allows the creation of self-defining hierarchical data structures in a form which allows the data to be moved between different machine architectures. Because the structures are self-defining they can be used for communication between independent modules in a distributed system. Unlike disk-based hierarchical data systems such as Starlink's HDS, SDS works entirely in memory and is very fast. Data structures are created and manipulated as internal dynamic structures in memory managed by SDS itself. A structure may then be exported into a caller supplied memory buffer in a defined external format. This structure can be written as a file or sent as a message to another machine. It remains static in structure until it is reimported into SDS. SDS is written in portable C and has been run on a number of different machine architectures. Structures are portable between machines with SDS looking after conversion of byte order, floating point format, and alignment. A Fortran callable version is also available for some machines.
Hyperswitch Communication Network Computer

NASA Technical Reports Server (NTRS)

Peterson, John C.; Chow, Edward T.; Priel, Moshe; Upchurch, Edwin T.

1993-01-01

Hyperswitch Communications Network (HCN) computer is prototype multiple-processor computer being developed. Incorporates improved version of hyperswitch communication network described in "Hyperswitch Network For Hypercube Computer" (NPO-16905). Designed to support high-level software and expansion of itself. HCN computer is message-passing, multiple-instruction/multiple-data computer offering significant advantages over older single-processor and bus-based multiple-processor computers, with respect to price/performance ratio, reliability, availability, and manufacturing. Design of HCN operating-system software provides flexible computing environment accommodating both parallel and distributed processing. Also achieves balance among following competing factors; performance in processing and communications, ease of use, and tolerance of (and recovery from) faults.
A Comparison of Automatic Parallelization Tools/Compilers on the SGI Origin 2000 Using the NAS Benchmarks

NASA Technical Reports Server (NTRS)

Saini, Subhash; Frumkin, Michael; Hribar, Michelle; Jin, Hao-Qiang; Waheed, Abdul; Yan, Jerry

1998-01-01

Porting applications to new high performance parallel and distributed computing platforms is a challenging task. Since writing parallel code by hand is extremely time consuming and costly, porting codes would ideally be automated by using some parallelization tools and compilers. In this paper, we compare the performance of the hand written NAB Parallel Benchmarks against three parallel versions generated with the help of tools and compilers: 1) CAPTools: an interactive computer aided parallelization too] that generates message passing code, 2) the Portland Group's HPF compiler and 3) using compiler directives with the native FORTAN77 compiler on the SGI Origin2000.
Application of decentralized cooperative problem solving in dynamic flexible scheduling

NASA Astrophysics Data System (ADS)

Guan, Zai-Lin; Lei, Ming; Wu, Bo; Wu, Ya; Yang, Shuzi

1995-08-01

The object of this study is to discuss an intelligent solution to the problem of task-allocation in shop floor scheduling. For this purpose, the technique of distributed artificial intelligence (DAI) is applied. Intelligent agents (IAs) are used to realize decentralized cooperation, and negotiation is realized by using message passing based on the contract net model. Multiple agents, such as manager agents, workcell agents, and workstation agents, make game-like decisions based on multiple criteria evaluations. This procedure of decentralized cooperative problem solving makes local scheduling possible. And by integrating such multiple local schedules, dynamic flexible scheduling for the whole shop floor production can be realized.
An MPA-IO interface to HPSS

NASA Technical Reports Server (NTRS)

Jones, Terry; Mark, Richard; Martin, Jeanne; May, John; Pierce, Elsie; Stanberry, Linda

1996-01-01

This paper describes an implementation of the proposed MPI-IO (Message Passing Interface - Input/Output) standard for parallel I/O. Our system uses third-party transfer to move data over an external network between the processors where it is used and the I/O devices where it resides. Data travels directly from source to destination, without the need for shuffling it among processors or funneling it through a central node. Our distributed server model lets multiple compute nodes share the burden of coordinating data transfers. The system is built on the High Performance Storage System (HPSS), and a prototype version runs on a Meiko CS-2 parallel computer.
Statistical mechanics of unsupervised feature learning in a restricted Boltzmann machine with binary synapses

NASA Astrophysics Data System (ADS)

Huang, Haiping

2017-05-01

Revealing hidden features in unlabeled data is called unsupervised feature learning, which plays an important role in pretraining a deep neural network. Here we provide a statistical mechanics analysis of the unsupervised learning in a restricted Boltzmann machine with binary synapses. A message passing equation to infer the hidden feature is derived, and furthermore, variants of this equation are analyzed. A statistical analysis by replica theory describes the thermodynamic properties of the model. Our analysis confirms an entropy crisis preceding the non-convergence of the message passing equation, suggesting a discontinuous phase transition as a key characteristic of the restricted Boltzmann machine. Continuous phase transition is also confirmed depending on the embedded feature strength in the data. The mean-field result under the replica symmetric assumption agrees with that obtained by running message passing algorithms on single instances of finite sizes. Interestingly, in an approximate Hopfield model, the entropy crisis is absent, and a continuous phase transition is observed instead. We also develop an iterative equation to infer the hyper-parameter (temperature) hidden in the data, which in physics corresponds to iteratively imposing Nishimori condition. Our study provides insights towards understanding the thermodynamic properties of the restricted Boltzmann machine learning, and moreover important theoretical basis to build simplified deep networks.
A knowledge base architecture for distributed knowledge agents

NASA Technical Reports Server (NTRS)

Riedesel, Joel; Walls, Bryan

1990-01-01

A tuple space based object oriented model for knowledge base representation and interpretation is presented. An architecture for managing distributed knowledge agents is then implemented within the model. The general model is based upon a database implementation of a tuple space. Objects are then defined as an additional layer upon the database. The tuple space may or may not be distributed depending upon the database implementation. A language for representing knowledge and inference strategy is defined whose implementation takes advantage of the tuple space. The general model may then be instantiated in many different forms, each of which may be a distinct knowledge agent. Knowledge agents may communicate using tuple space mechanisms as in the LINDA model as well as using more well known message passing mechanisms. An implementation of the model is presented describing strategies used to keep inference tractable without giving up expressivity. An example applied to a power management and distribution network for Space Station Freedom is given.
Distraction Effects of Smoking Cues in Antismoking Messages: Examining Resource Allocation to Message Processing as a Function of Smoking Cues and Argument Strength

PubMed Central

Lee, Sungkyoung; Cappella, Joseph N.

2014-01-01

Findings from previous studies on smoking cues and argument strength in antismoking messages have shown that the presence of smoking cues undermines the persuasiveness of antismoking public service announcements (PSAs) with weak arguments. This study conceptualized smoking cues (i.e., scenes showing smoking-related objects and behaviors) as stimuli motivationally relevant to the former smoker population and examined how smoking cues influence former smokers’ processing of antismoking PSAs. Specifically, by defining smoking cues and the strength of antismoking arguments in terms of resource allocation, this study examined former smokers’ recognition accuracy, memory strength, and memory judgment of visual (i.e., scenes excluding smoking cues) and audio information from antismoking PSAs. In line with previous findings, the results of the study showed that the presence of smoking cues undermined former smokers’ encoding of antismoking arguments, which includes the visual and audio information that compose the main content of antismoking messages. PMID:25477766
Fencing direct memory access data transfers in a parallel active messaging interface of a parallel computer

DOEpatents

Blocksome, Michael A.; Mamidala, Amith R.

2013-09-03

Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to segments of shared random access memory through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and a segment of shared memory; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.
Fencing direct memory access data transfers in a parallel active messaging interface of a parallel computer

DOEpatents

Blocksome, Michael A; Mamidala, Amith R

2014-02-11

Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to segments of shared random access memory through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and a segment of shared memory; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.

Hyperswitch communication network

NASA Technical Reports Server (NTRS)

Peterson, J.; Pniel, M.; Upchurch, E.

1991-01-01

The Hyperswitch Communication Network (HCN) is a large scale parallel computer prototype being developed at JPL. Commercial versions of the HCN computer are planned. The HCN computer being designed is a message passing multiple instruction multiple data (MIMD) computer, and offers many advantages in price-performance ratio, reliability and availability, and manufacturing over traditional uniprocessors and bus based multiprocessors. The design of the HCN operating system is a uniquely flexible environment that combines both parallel processing and distributed processing. This programming paradigm can achieve a balance among the following competing factors: performance in processing and communications, user friendliness, and fault tolerance. The prototype is being designed to accommodate a maximum of 64 state of the art microprocessors. The HCN is classified as a distributed supercomputer. The HCN system is described, and the performance/cost analysis and other competing factors within the system design are reviewed.
Portability and Cross-Platform Performance of an MPI-Based Parallel Polygon Renderer

NASA Technical Reports Server (NTRS)

Crockett, Thomas W.

1999-01-01

Visualizing the results of computations performed on large-scale parallel computers is a challenging problem, due to the size of the datasets involved. One approach is to perform the visualization and graphics operations in place, exploiting the available parallelism to obtain the necessary rendering performance. Over the past several years, we have been developing algorithms and software to support visualization applications on NASA's parallel supercomputers. Our results have been incorporated into a parallel polygon rendering system called PGL. PGL was initially developed on tightly-coupled distributed-memory message-passing systems, including Intel's iPSC/860 and Paragon, and IBM's SP2. Over the past year, we have ported it to a variety of additional platforms, including the HP Exemplar, SGI Origin2OOO, Cray T3E, and clusters of Sun workstations. In implementing PGL, we have had two primary goals: cross-platform portability and high performance. Portability is important because (1) our manpower resources are limited, making it difficult to develop and maintain multiple versions of the code, and (2) NASA's complement of parallel computing platforms is diverse and subject to frequent change. Performance is important in delivering adequate rendering rates for complex scenes and ensuring that parallel computing resources are used effectively. Unfortunately, these two goals are often at odds. In this paper we report on our experiences with portability and performance of the PGL polygon renderer across a range of parallel computing platforms.
Earth Observation

NASA Image and Video Library

2013-08-20

Earth observation taken during day pass by an Expedition 36 crew member on board the International Space Station (ISS). Per Twitter message: Looking southwest over northern Africa. Libya, Algeria, Niger.
Approximate message passing with restricted Boltzmann machine priors

NASA Astrophysics Data System (ADS)

Tramel, Eric W.; Drémeau, Angélique; Krzakala, Florent

2016-07-01

Approximate message passing (AMP) has been shown to be an excellent statistical approach to signal inference and compressed sensing problems. The AMP framework provides modularity in the choice of signal prior; here we propose a hierarchical form of the Gauss-Bernoulli prior which utilizes a restricted Boltzmann machine (RBM) trained on the signal support to push reconstruction performance beyond that of simple i.i.d. priors for signals whose support can be well represented by a trained binary RBM. We present and analyze two methods of RBM factorization and demonstrate how these affect signal reconstruction performance within our proposed algorithm. Finally, using the MNIST handwritten digit dataset, we show experimentally that using an RBM allows AMP to approach oracle-support performance.
Visualization Co-Processing of a CFD Simulation

NASA Technical Reports Server (NTRS)

Vaziri, Arsi

1999-01-01

OVERFLOW, a widely used CFD simulation code, is combined with a visualization system, pV3, to experiment with an environment for simulation/visualization co-processing on a SGI Origin 2000 computer(O2K) system. The shared memory version of the solver is used with the O2K 'pfa' preprocessor invoked to automatically discover parallelism in the source code. No other explicit parallelism is enabled. In order to study the scaling and performance of the visualization co-processing system, sample runs are made with different processor groups in the range of 1 to 254 processors. The data exchange between the visualization system and the simulation system is rapid enough for user interactivity when the problem size is small. This shared memory version of OVERFLOW, with minimal parallelization, does not scale well to an increasing number of available processors. The visualization task takes about 18 to 30% of the total processing time and does not appear to be a major contributor to the poor scaling. Improper load balancing and inter-processor communication overhead are contributors to this poor performance. Work is in progress which is aimed at obtaining improved parallel performance of the solver and removing the limitations of serial data transfer to pV3 by examining various parallelization/communication strategies, including the use of the explicit message passing.
SpaceWire Driver Software for Special DSPs

NASA Technical Reports Server (NTRS)

Clark, Douglas; Lux, James; Nishimoto, Kouji; Lang, Minh

2003-01-01

A computer program provides a high-level C-language interface to electronics circuitry that controls a SpaceWire interface in a system based on a space qualified version of the ADSP-21020 digital signal processor (DSP). SpaceWire is a spacecraft-oriented standard for packet-switching data-communication networks that comprise nodes connected through bidirectional digital serial links that utilize low-voltage differential signaling (LVDS). The software is tailored to the SMCS-332 application-specific integrated circuit (ASIC) (also available as the TSS901E), which provides three highspeed (150 Mbps) serial point-to-point links compliant with the proposed Institute of Electrical and Electronics Engineers (IEEE) Standard 1355.2 and equivalent European Space Agency (ESA) Standard ECSS-E-50-12. In the specific application of this software, the SpaceWire ASIC was combined with the DSP processor, memory, and control logic in a Multi-Chip Module DSP (MCM-DSP). The software is a collection of low-level driver routines that provide a simple message-passing application programming interface (API) for software running on the DSP. Routines are provided for interrupt-driven access to the two styles of interface provided by the SMCS: (1) the "word at a time" conventional host interface (HOCI); and (2) a higher performance "dual port memory" style interface (COMI).
Use of Hilbert Curves in Parallelized CUDA code: Interaction of Interstellar Atoms with the Heliosphere

NASA Astrophysics Data System (ADS)

Destefano, Anthony; Heerikhuisen, Jacob

2015-04-01

Fully 3D particle simulations can be a computationally and memory expensive task, especially when high resolution grid cells are required. The problem becomes further complicated when parallelization is needed. In this work we focus on computational methods to solve these difficulties. Hilbert curves are used to map the 3D particle space to the 1D contiguous memory space. This method of organization allows for minimized cache misses on the GPU as well as a sorted structure that is equivalent to an octal tree data structure. This type of sorted structure is attractive for uses in adaptive mesh implementations due to the logarithm search time. Implementations using the Message Passing Interface (MPI) library and NVIDIA's parallel computing platform CUDA will be compared, as MPI is commonly used on server nodes with many CPU's. We will also compare static grid structures with those of adaptive mesh structures. The physical test bed will be simulating heavy interstellar atoms interacting with a background plasma, the heliosphere, simulated from fully consistent coupled MHD/kinetic particle code. It is known that charge exchange is an important factor in space plasmas, specifically it modifies the structure of the heliosphere itself. We would like to thank the Alabama Supercomputer Authority for the use of their computational resources.
Intranode data communications in a parallel computer

DOEpatents

Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Ratterman, Joseph D; Smith, Brian E

2014-01-07

Intranode data communications in a parallel computer that includes compute nodes configured to execute processes, where the data communications include: allocating, upon initialization of a first process of a computer node, a region of shared memory; establishing, by the first process, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; sending, to a second process on the same compute node, a data communications message without determining whether the second process has been initialized, including storing the data communications message in the message buffer of the second process; and upon initialization of the second process: retrieving, by the second process, a pointer to the second process's message buffer; and retrieving, by the second process from the second process's message buffer in dependence upon the pointer, the data communications message sent by the first process.
Intranode data communications in a parallel computer

DOEpatents

Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Ratterman, Joseph D; Smith, Brian E

2013-07-23

Intranode data communications in a parallel computer that includes compute nodes configured to execute processes, where the data communications include: allocating, upon initialization of a first process of a compute node, a region of shared memory; establishing, by the first process, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; sending, to a second process on the same compute node, a data communications message without determining whether the second process has been initialized, including storing the data communications message in the message buffer of the second process; and upon initialization of the second process: retrieving, by the second process, a pointer to the second process's message buffer; and retrieving, by the second process from the second process's message buffer in dependence upon the pointer, the data communications message sent by the first process.
Signaling completion of a message transfer from an origin compute node to a target compute node

DOEpatents

Blocksome, Michael A [Rochester, MN

2011-02-15

Signaling completion of a message transfer from an origin node to a target node includes: sending, by an origin DMA engine, an RTS message, the RTS message specifying an application message for transfer to the target node from the origin node; receiving, by the origin DMA engine, a remote get message containing a data descriptor for the message and a completion notification descriptor, the completion notification descriptor specifying a local memory FIFO data transfer operation for transferring data locally on the origin node; inserting, by the origin DMA engine in an injection FIFO buffer, the data descriptor followed by the completion notification descriptor; transferring, by the origin DMA engine to the target node, the message in dependence upon the data descriptor; and notifying, by the origin DMA engine, the application that transfer of the message is complete in dependence upon the completion notification descriptor.
A New Random Walk for Replica Detection in WSNs.

PubMed

Aalsalem, Mohammed Y; Khan, Wazir Zada; Saad, N M; Hossain, Md Shohrab; Atiquzzaman, Mohammed; Khan, Muhammad Khurram

2016-01-01

Wireless Sensor Networks (WSNs) are vulnerable to Node Replication attacks or Clone attacks. Among all the existing clone detection protocols in WSNs, RAWL shows the most promising results by employing Simple Random Walk (SRW). More recently, RAND outperforms RAWL by incorporating Network Division with SRW. Both RAND and RAWL have used SRW for random selection of witness nodes which is problematic because of frequently revisiting the previously passed nodes that leads to longer delays, high expenditures of energy with lower probability that witness nodes intersect. To circumvent this problem, we propose to employ a new kind of constrained random walk, namely Single Stage Memory Random Walk and present a distributed technique called SSRWND (Single Stage Memory Random Walk with Network Division). In SSRWND, single stage memory random walk is combined with network division aiming to decrease the communication and memory costs while keeping the detection probability higher. Through intensive simulations it is verified that SSRWND guarantees higher witness node security with moderate communication and memory overheads. SSRWND is expedient for security oriented application fields of WSNs like military and medical.
A New Random Walk for Replica Detection in WSNs

PubMed Central

Aalsalem, Mohammed Y.; Saad, N. M.; Hossain, Md. Shohrab; Atiquzzaman, Mohammed; Khan, Muhammad Khurram

2016-01-01

Wireless Sensor Networks (WSNs) are vulnerable to Node Replication attacks or Clone attacks. Among all the existing clone detection protocols in WSNs, RAWL shows the most promising results by employing Simple Random Walk (SRW). More recently, RAND outperforms RAWL by incorporating Network Division with SRW. Both RAND and RAWL have used SRW for random selection of witness nodes which is problematic because of frequently revisiting the previously passed nodes that leads to longer delays, high expenditures of energy with lower probability that witness nodes intersect. To circumvent this problem, we propose to employ a new kind of constrained random walk, namely Single Stage Memory Random Walk and present a distributed technique called SSRWND (Single Stage Memory Random Walk with Network Division). In SSRWND, single stage memory random walk is combined with network division aiming to decrease the communication and memory costs while keeping the detection probability higher. Through intensive simulations it is verified that SSRWND guarantees higher witness node security with moderate communication and memory overheads. SSRWND is expedient for security oriented application fields of WSNs like military and medical. PMID:27409082
A multiprocessing architecture for real-time monitoring

NASA Technical Reports Server (NTRS)

Schmidt, James L.; Kao, Simon M.; Read, Jackson Y.; Weitzenkamp, Scott M.; Laffey, Thomas J.

1988-01-01

A multitasking architecture for performing real-time monitoring and analysis using knowledge-based problem solving techniques is described. To handle asynchronous inputs and perform in real time, the system consists of three or more distributed processes which run concurrently and communicate via a message passing scheme. The Data Management Process acquires, compresses, and routes the incoming sensor data to other processes. The Inference Process consists of a high performance inference engine that performs a real-time analysis on the state and health of the physical system. The I/O Process receives sensor data from the Data Management Process and status messages and recommendations from the Inference Process, updates its graphical displays in real time, and acts as the interface to the console operator. The distributed architecture has been interfaced to an actual spacecraft (NASA's Hubble Space Telescope) and is able to process the incoming telemetry in real-time (i.e., several hundred data changes per second). The system is being used in two locations for different purposes: (1) in Sunnyville, California at the Space Telescope Test Control Center it is used in the preflight testing of the vehicle; and (2) in Greenbelt, Maryland at NASA/Goddard it is being used on an experimental basis in flight operations for health and safety monitoring.
An Element-Based Concurrent Partitioner for Unstructured Finite Element Meshes

NASA Technical Reports Server (NTRS)

Ding, Hong Q.; Ferraro, Robert D.

1996-01-01

A concurrent partitioner for partitioning unstructured finite element meshes on distributed memory architectures is developed. The partitioner uses an element-based partitioning strategy. Its main advantage over the more conventional node-based partitioning strategy is its modular programming approach to the development of parallel applications. The partitioner first partitions element centroids using a recursive inertial bisection algorithm. Elements and nodes then migrate according to the partitioned centroids, using a data request communication template for unpredictable incoming messages. Our scalable implementation is contrasted to a non-scalable implementation which is a straightforward parallelization of a sequential partitioner.
Large-scale parallel lattice Boltzmann-cellular automaton model of two-dimensional dendritic growth

NASA Astrophysics Data System (ADS)

Jelinek, Bohumir; Eshraghi, Mohsen; Felicelli, Sergio; Peters, John F.

2014-03-01

An extremely scalable lattice Boltzmann (LB)-cellular automaton (CA) model for simulations of two-dimensional (2D) dendritic solidification under forced convection is presented. The model incorporates effects of phase change, solute diffusion, melt convection, and heat transport. The LB model represents the diffusion, convection, and heat transfer phenomena. The dendrite growth is driven by a difference between actual and equilibrium liquid composition at the solid-liquid interface. The CA technique is deployed to track the new interface cells. The computer program was parallelized using the Message Passing Interface (MPI) technique. Parallel scaling of the algorithm was studied and major scalability bottlenecks were identified. Efficiency loss attributable to the high memory bandwidth requirement of the algorithm was observed when using multiple cores per processor. Parallel writing of the output variables of interest was implemented in the binary Hierarchical Data Format 5 (HDF5) to improve the output performance, and to simplify visualization. Calculations were carried out in single precision arithmetic without significant loss in accuracy, resulting in 50% reduction of memory and computational time requirements. The presented solidification model shows a very good scalability up to centimeter size domains, including more than ten million of dendrites. Catalogue identifier: AEQZ_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEQZ_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, UK Licensing provisions: Standard CPC license, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 29,767 No. of bytes in distributed program, including test data, etc.: 3131,367 Distribution format: tar.gz Programming language: Fortran 90. Computer: Linux PC and clusters. Operating system: Linux. Has the code been vectorized or parallelized?: Yes. Program is parallelized using MPI. Number of processors used: 1-50,000 RAM: Memory requirements depend on the grid size Classification: 6.5, 7.7. External routines: MPI (http://www.mcs.anl.gov/research/projects/mpi/), HDF5 (http://www.hdfgroup.org/HDF5/) Nature of problem: Dendritic growth in undercooled Al-3 wt% Cu alloy melt under forced convection. Solution method: The lattice Boltzmann model solves the diffusion, convection, and heat transfer phenomena. The cellular automaton technique is deployed to track the solid/liquid interface. Restrictions: Heat transfer is calculated uncoupled from the fluid flow. Thermal diffusivity is constant. Unusual features: Novel technique, utilizing periodic duplication of a pre-grown “incubation” domain, is applied for the scaleup test. Running time: Running time varies from minutes to days depending on the domain size and number of computational cores.
Passing messages between biological networks to refine predicted interactions.

PubMed

Glass, Kimberly; Huttenhower, Curtis; Quackenbush, John; Yuan, Guo-Cheng

2013-01-01

Regulatory network reconstruction is a fundamental problem in computational biology. There are significant limitations to such reconstruction using individual datasets, and increasingly people attempt to construct networks using multiple, independent datasets obtained from complementary sources, but methods for this integration are lacking. We developed PANDA (Passing Attributes between Networks for Data Assimilation), a message-passing model using multiple sources of information to predict regulatory relationships, and used it to integrate protein-protein interaction, gene expression, and sequence motif data to reconstruct genome-wide, condition-specific regulatory networks in yeast as a model. The resulting networks were not only more accurate than those produced using individual data sets and other existing methods, but they also captured information regarding specific biological mechanisms and pathways that were missed using other methodologies. PANDA is scalable to higher eukaryotes, applicable to specific tissue or cell type data and conceptually generalizable to include a variety of regulatory, interaction, expression, and other genome-scale data. An implementation of the PANDA algorithm is available at www.sourceforge.net/projects/panda-net.
Flexible Language Constructs for Large Parallel Programs

DOE PAGES

Rosing, Matt; Schnabel, Robert

1994-01-01

The goal of the research described in this article is to develop flexible language constructs for writing large data parallel numerical programs for distributed memory (multiple instruction multiple data [MIMD]) multiprocessors. Previously, several models have been developed to support synchronization and communication. Models for global synchronization include single instruction multiple data (SIMD), single program multiple data (SPMD), and sequential programs annotated with data distribution statements. The two primary models for communication include implicit communication based on shared memory and explicit communication based on messages. None of these models by themselves seem sufficient to permit the natural and efficient expression ofmore » the variety of algorithms that occur in large scientific computations. In this article, we give an overview of a new language that combines many of these programming models in a clean manner. This is done in a modular fashion such that different models can be combined to support large programs. Within a module, the selection of a model depends on the algorithm and its efficiency requirements. In this article, we give an overview of the language and discuss some of the critical implementation details.« less
Influence of emotional content and context on memory in mild Alzheimer's disease.

PubMed

Perrin, Margaux; Henaff, Marie-Anne; Padovan, Catherine; Faillenot, Isabelle; Merville, Adrien; Krolak-Salmon, Pierre

2012-01-01

Healthy subjects remember emotional stimuli better than neutral, as well as stimuli embedded in an emotional context. This better memory of emotional messages is linked to an amygdalo-hippocampal cooperation taking place in a larger fronto-temporal network particularly sensitive to pathological aging. Amygdala is mainly involved in gist memory of emotional messages. Whether emotional content or context enhances memory in mild Alzheimer's disease (AD) patients is still debated. The aim of the present study is to examine the influence of emotional content and emotional context on the memory in mild AD, and whether this influence is linked to amygdala volume. Fifteen patients affected by mild AD and 15 age-matched controls were submitted to series of negative, positive, and neutral pictures. Each series was embedded in an emotional or neutral sound context. At the end of each series, participants had to freely recall pictures, and answer questions about each picture. Amygdala volumes were measured on patient 3D-MRI scans. In the present study, emotional content significantly favored memory of gist but not of details in healthy elderly and in AD patients. Patients' amygdala volume was positively correlated to emotional content memory effect, implying a reduced memory benefit from emotional content when amygdala was atrophied. A positive context enhanced memory of pictures in healthy elderly, but not in AD, corroborating early fronto-temporal dysfunction and early working memory limitation in this disease.
Defining Audio/Video Redundancy from a Limited Capacity Information Processing Perspective.

ERIC Educational Resources Information Center

Lang, Annie

1995-01-01

Investigates whether audio/video redundancy improves memory for television messages. Suggests a theoretical framework for classifying previous work and reinterpreting the results. Suggests general support for the notion that redundancy levels affect the capacity requirements of the message, which impact differentially on audio or visual…
Talking Wheelchair

NASA Technical Reports Server (NTRS)

1981-01-01

Communication is made possible for disabled individuals by means of an electronic system, developed at Stanford University's School of Medicine, which produces highly intelligible synthesized speech. Familiarly known as the "talking wheelchair" and formally as the Versatile Portable Speech Prosthesis (VPSP). Wheelchair mounted system consists of a word processor, a video screen, a voice synthesizer and a computer program which instructs the synthesizer how to produce intelligible sounds in response to user commands. Computer's memory contains 925 words plus a number of common phrases and questions. Memory can also store several thousand other words of the user's choice. Message units are selected by operating a simple switch, joystick or keyboard. Completed message appears on the video screen, then user activates speech synthesizer, which generates a voice with a somewhat mechanical tone. With the keyboard, an experienced user can construct messages as rapidly as 30 words per minute.

Xyce parallel electronic simulator : users' guide.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mei, Ting; Rankin, Eric Lamont; Thornquist, Heidi K.

2011-05-01

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: (1) Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers; (2) Improved performance for all numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-artmore » algorithms and novel techniques. (3) Device models which are specifically tailored to meet Sandia's needs, including some radiation-aware devices (for Sandia users only); and (4) Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing parallel implementation - which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The development of Xyce provides a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods, parallel solver algorithms) research and development can be performed. As a result, Xyce is a unique electrical simulation capability, designed to meet the unique needs of the laboratory.« less
Randomized Dynamic Mode Decomposition

NASA Astrophysics Data System (ADS)

Erichson, N. Benjamin; Brunton, Steven L.; Kutz, J. Nathan

2017-11-01

The dynamic mode decomposition (DMD) is an equation-free, data-driven matrix decomposition that is capable of providing accurate reconstructions of spatio-temporal coherent structures arising in dynamical systems. We present randomized algorithms to compute the near-optimal low-rank dynamic mode decomposition for massive datasets. Randomized algorithms are simple, accurate and able to ease the computational challenges arising with `big data'. Moreover, randomized algorithms are amenable to modern parallel and distributed computing. The idea is to derive a smaller matrix from the high-dimensional input data matrix using randomness as a computational strategy. Then, the dynamic modes and eigenvalues are accurately learned from this smaller representation of the data, whereby the approximation quality can be controlled via oversampling and power iterations. Here, we present randomized DMD algorithms that are categorized by how many passes the algorithm takes through the data. Specifically, the single-pass randomized DMD does not require data to be stored for subsequent passes. Thus, it is possible to approximately decompose massive fluid flows (stored out of core memory, or not stored at all) using single-pass algorithms, which is infeasible with traditional DMD algorithms.
Earth observation taken by the Expedition 46 crew

NASA Image and Video Library

2016-01-23

ISS046e021993 (01/23/2016) --- Earth observation of the coast of Oman taken during a night pass by the Expedition 46 crew aboard the International Space Station. NASA astronaut Tim Kopra tweeted this image out with this message: "Passing over the Gulf of #Oman at night -- city lights of #Muscat #Dubai #AbuDhabi and #Doha in the distance".
Optimized collectives using a DMA on a parallel computer

DOEpatents

Chen, Dong [Croton On Hudson, NY; Gabor, Dozsa [Ardsley, NY; Giampapa, Mark E [Irvington, NY; Heidelberger,; Phillip, [Cortlandt Manor, NY

2011-02-08

Optimizing collective operations using direct memory access controller on a parallel computer, in one aspect, may comprise establishing a byte counter associated with a direct memory access controller for each submessage in a message. The byte counter includes at least a base address of memory and a byte count associated with a submessage. A byte counter associated with a submessage is monitored to determine whether at least a block of data of the submessage has been received. The block of data has a predetermined size, for example, a number of bytes. The block is processed when the block has been fully received, for example, when the byte count indicates all bytes of the block have been received. The monitoring and processing may continue for all blocks in all submessages in the message.
Everyday memory impairment in patients with temporal lobe epilepsy caused by hippocampal sclerosis.

PubMed

Rzezak, Patrícia; Lima, Ellen Marise; Gargaro, Ana Carolina; Coimbra, Erica; de Vincentiis, Silvia; Velasco, Tonicarlo Rodrigues; Leite, João Pereira; Busatto, Geraldo F; Valente, Kette D

2017-04-01

Patients with temporal lobe epilepsy caused by hippocampal sclerosis (TLE-HS) have episodic memory impairment. Memory has rarely been evaluated using an ecologic measure, even though performance on these tests is more related to patients' memory complaints. We aimed to measure everyday memory of patients with TLE-HS to age- and gender-matched controls. We evaluated 31 patients with TLE-HS and 34 healthy controls, without epilepsy and psychiatric disorders, using the Rivermead Behavioral Memory Test (RBMT), Visual Reproduction (WMS-III) and Logical Memory (WMS-III). We evaluated the impact of clinical variables such as the age of onset, epilepsy duration, AED use, history of status epilepticus, and seizure frequency on everyday memory. Statistical analyses were performed using MANCOVA with years of education as a confounding factor. Patients showed worse performance than controls on traditional memory tests and in the overall score of RBMT. Patients had more difficulties to recall names, a hidden belonging, to deliver a message, object recognition, to remember a story full of details, a previously presented short route, and in time and space orientation. Clinical epilepsy variables were not associated with RBMT performance. Memory span and working memory were correlated with worse performance on RBMT. Patients with TLE-HS demonstrated deficits in everyday memory functions. A standard neuropsychological battery, designed to assess episodic memory, would not evaluate these impairments. Impairment in recalling names, routes, stories, messages, and space/time disorientation can adversely impact social adaptation, and we must consider these ecologic measures with greater attention in the neuropsychological evaluation of patients with memory complaints. Copyright © 2017 Elsevier Inc. All rights reserved.
Statistical mechanics of budget-constrained auctions

NASA Astrophysics Data System (ADS)

Altarelli, F.; Braunstein, A.; Realpe-Gomez, J.; Zecchina, R.

2009-07-01

Finding the optimal assignment in budget-constrained auctions is a combinatorial optimization problem with many important applications, a notable example being in the sale of advertisement space by search engines (in this context the problem is often referred to as the off-line AdWords problem). On the basis of the cavity method of statistical mechanics, we introduce a message-passing algorithm that is capable of solving efficiently random instances of the problem extracted from a natural distribution, and we derive from its properties the phase diagram of the problem. As the control parameter (average value of the budgets) is varied, we find two phase transitions delimiting a region in which long-range correlations arise.
Mean-field message-passing equations in the Hopfield model and its generalizations

NASA Astrophysics Data System (ADS)

Mézard, Marc

2017-02-01

Motivated by recent progress in using restricted Boltzmann machines as preprocessing algorithms for deep neural network, we revisit the mean-field equations [belief-propagation and Thouless-Anderson Palmer (TAP) equations] in the best understood of such machines, namely the Hopfield model of neural networks, and we explicit how they can be used as iterative message-passing algorithms, providing a fast method to compute the local polarizations of neurons. In the "retrieval phase", where neurons polarize in the direction of one memorized pattern, we point out a major difference between the belief propagation and TAP equations: The set of belief propagation equations depends on the pattern which is retrieved, while one can use a unique set of TAP equations. This makes the latter method much better suited for applications in the learning process of restricted Boltzmann machines. In the case where the patterns memorized in the Hopfield model are not independent, but are correlated through a combinatorial structure, we show that the TAP equations have to be modified. This modification can be seen either as an alteration of the reaction term in TAP equations or, more interestingly, as the consequence of message passing on a graphical model with several hidden layers, where the number of hidden layers depends on the depth of the correlations in the memorized patterns. This layered structure is actually necessary when one deals with more general restricted Boltzmann machines.
Earth Observation

NASA Image and Video Library

2013-08-03

Earth observation taken during day pass by an Expedition 36 crew member on board the International Space Station (ISS). Per Twitter message: Perhaps a dandelion losing its seeds in the wind? Love clouds!
Paralysis

MedlinePlus

Paralysis is the loss of muscle function in part of your body. It happens when something goes ... way messages pass between your brain and muscles. Paralysis can be complete or partial. It can occur ...
Advanced Numerical Techniques of Performance Evaluation. Volume 2

DTIC Science & Technology

1990-06-01

multiprocessor environment. This factor is determined by the overhead of the primitives available in the system ( semaphore , monitor , or message... semaphore , monitor , or message passing primitives ) and U the programming ability of the user who implements the simulation. " t,: the sequential...Warp Operating System . i Pro" lftevcnth ACM Symposum on Operating Systems Princlplcs, pages 77 9:3, Auslin, TX, Nov wicr 1987. ACM. [121 D.R. Jefferson
Benchmarking hypercube hardware and software

NASA Technical Reports Server (NTRS)

Grunwald, Dirk C.; Reed, Daniel A.

1986-01-01

It was long a truism in computer systems design that balanced systems achieve the best performance. Message passing parallel processors are no different. To quantify the balance of a hypercube design, an experimental methodology was developed and the associated suite of benchmarks was applied to several existing hypercubes. The benchmark suite includes tests of both processor speed in the absence of internode communication and message transmission speed as a function of communication patterns.
Data transfer using complete bipartite graph

NASA Astrophysics Data System (ADS)

Chandrasekaran, V. M.; Praba, B.; Manimaran, A.; Kailash, G.

2017-11-01

Information exchange extent is an estimation of the amount of information sent between two focuses on a framework in a given time period. It is an extremely significant perception in present world. There are many ways of message passing in the present situations. Some of them are through encryption, decryption, by using complete bipartite graph. In this paper, we recommend a method for communication using messages through encryption of a complete bipartite graph.
Accelerating three-dimensional FDTD calculations on GPU clusters for electromagnetic field simulation.

PubMed

Nagaoka, Tomoaki; Watanabe, Soichi

2012-01-01

Electromagnetic simulation with anatomically realistic computational human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the computational human model, we adapt three-dimensional FDTD code to a multi-GPU cluster environment with Compute Unified Device Architecture and Message Passing Interface. Our multi-GPU cluster system consists of three nodes. The seven GPU boards (NVIDIA Tesla C2070) are mounted on each node. We examined the performance of the FDTD calculation on multi-GPU cluster environment. We confirmed that the FDTD calculation on the multi-GPU clusters is faster than that on a multi-GPU (a single workstation), and we also found that the GPU cluster system calculate faster than a vector supercomputer. In addition, our GPU cluster system allowed us to perform the large-scale FDTD calculation because were able to use GPU memory of over 100 GB.
Load Balancing Strategies for Multiphase Flows on Structured Grids

NASA Astrophysics Data System (ADS)

Olshefski, Kristopher; Owkes, Mark

2017-11-01

The computation time required to perform large simulations of complex systems is currently one of the leading bottlenecks of computational research. Parallelization allows multiple processing cores to perform calculations simultaneously and reduces computational times. However, load imbalances between processors waste computing resources as processors wait for others to complete imbalanced tasks. In multiphase flows, these imbalances arise due to the additional computational effort required at the gas-liquid interface. However, many current load balancing schemes are only designed for unstructured grid applications. The purpose of this research is to develop a load balancing strategy while maintaining the simplicity of a structured grid. Several approaches are investigated including brute force oversubscription, node oversubscription through Message Passing Interface (MPI) commands, and shared memory load balancing using OpenMP. Each of these strategies are tested with a simple one-dimensional model prior to implementation into the three-dimensional NGA code. Current results show load balancing will reduce computational time by at least 30%.
Spectral Element Method for the Simulation of Unsteady Compressible Flows

NASA Technical Reports Server (NTRS)

Diosady, Laslo Tibor; Murman, Scott M.

2013-01-01

This work uses a discontinuous-Galerkin spectral-element method (DGSEM) to solve the compressible Navier-Stokes equations [1{3]. The inviscid ux is computed using the approximate Riemann solver of Roe [4]. The viscous fluxes are computed using the second form of Bassi and Rebay (BR2) [5] in a manner consistent with the spectral-element approximation. The method of lines with the classical 4th-order explicit Runge-Kutta scheme is used for time integration. Results for polynomial orders up to p = 15 (16th order) are presented. The code is parallelized using the Message Passing Interface (MPI). The computations presented in this work are performed using the Sandy Bridge nodes of the NASA Pleiades supercomputer at NASA Ames Research Center. Each Sandy Bridge node consists of 2 eight-core Intel Xeon E5-2670 processors with a clock speed of 2.6Ghz and 2GB per core memory. On a Sandy Bridge node the Tau Benchmark [6] runs in a time of 7.6s.
Performance Characteristics of the Multi-Zone NAS Parallel Benchmarks

NASA Technical Reports Server (NTRS)

Jin, Haoqiang; VanderWijngaart, Rob F.

2003-01-01

We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of grids, but had not previously been captured in bench-marks. The new suite, named NPB Multi-Zone, is extended from the NAS Parallel Benchmarks suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the Message Passing Interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on three different parallel computers. We also use an empirical formula to investigate the performance characteristics of the multi-zone benchmarks.
Employing multi-GPU power for molecular dynamics simulation: an extension of GALAMOST

NASA Astrophysics Data System (ADS)

Zhu, You-Liang; Pan, Deng; Li, Zhan-Wei; Liu, Hong; Qian, Hu-Jun; Zhao, Yang; Lu, Zhong-Yuan; Sun, Zhao-Yan

2018-04-01

We describe the algorithm of employing multi-GPU power on the basis of Message Passing Interface (MPI) domain decomposition in a molecular dynamics code, GALAMOST, which is designed for the coarse-grained simulation of soft matters. The code of multi-GPU version is developed based on our previous single-GPU version. In multi-GPU runs, one GPU takes charge of one domain and runs single-GPU code path. The communication between neighbouring domains takes a similar algorithm of CPU-based code of LAMMPS, but is optimised specifically for GPUs. We employ a memory-saving design which can enlarge maximum system size at the same device condition. An optimisation algorithm is employed to prolong the update period of neighbour list. We demonstrate good performance of multi-GPU runs on the simulation of Lennard-Jones liquid, dissipative particle dynamics liquid, polymer and nanoparticle composite, and two-patch particles on workstation. A good scaling of many nodes on cluster for two-patch particles is presented.
The influence of banner advertisements on attention and memory: human faces with averted gaze can enhance advertising effectiveness

PubMed Central

Sajjacholapunt, Pitch; Ball, Linden J.

2014-01-01

Research suggests that banner advertisements used in online marketing are often overlooked, especially when positioned horizontally on webpages. Such inattention invariably gives rise to an inability to remember advertising brands and messages, undermining the effectiveness of this marketing method. Recent interest has focused on whether human faces within banner advertisements can increase attention to the information they contain, since the gaze cues conveyed by faces can influence where observers look. We report an experiment that investigated the efficacy of faces located in banner advertisements to enhance the attentional processing and memorability of banner contents. We tracked participants' eye movements when they examined webpages containing either bottom-right vertical banners or bottom-center horizontal banners. We also manipulated facial information such that banners either contained no face, a face with mutual gaze or a face with averted gaze. We additionally assessed people's memories for brands and advertising messages. Results indicated that relative to other conditions, the condition involving faces with averted gaze increased attention to the banner overall, as well as to the advertising text and product. Memorability of the brand and advertising message was also enhanced. Conversely, in the condition involving faces with mutual gaze, the focus of attention was localized more on the face region rather than on the text or product, weakening any memory benefits for the brand and advertising message. This detrimental impact of mutual gaze on attention to advertised products was especially marked for vertical banners. These results demonstrate that the inclusion of human faces with averted gaze in banner advertisements provides a promising means for marketers to increase the attention paid to such adverts, thereby enhancing memory for advertising information. PMID:24624104
The influence of banner advertisements on attention and memory: human faces with averted gaze can enhance advertising effectiveness.

PubMed

Sajjacholapunt, Pitch; Ball, Linden J

2014-01-01

Research suggests that banner advertisements used in online marketing are often overlooked, especially when positioned horizontally on webpages. Such inattention invariably gives rise to an inability to remember advertising brands and messages, undermining the effectiveness of this marketing method. Recent interest has focused on whether human faces within banner advertisements can increase attention to the information they contain, since the gaze cues conveyed by faces can influence where observers look. We report an experiment that investigated the efficacy of faces located in banner advertisements to enhance the attentional processing and memorability of banner contents. We tracked participants' eye movements when they examined webpages containing either bottom-right vertical banners or bottom-center horizontal banners. We also manipulated facial information such that banners either contained no face, a face with mutual gaze or a face with averted gaze. We additionally assessed people's memories for brands and advertising messages. Results indicated that relative to other conditions, the condition involving faces with averted gaze increased attention to the banner overall, as well as to the advertising text and product. Memorability of the brand and advertising message was also enhanced. Conversely, in the condition involving faces with mutual gaze, the focus of attention was localized more on the face region rather than on the text or product, weakening any memory benefits for the brand and advertising message. This detrimental impact of mutual gaze on attention to advertised products was especially marked for vertical banners. These results demonstrate that the inclusion of human faces with averted gaze in banner advertisements provides a promising means for marketers to increase the attention paid to such adverts, thereby enhancing memory for advertising information.
Statistical physics of hard combinatorial optimization: Vertex cover problem

NASA Astrophysics Data System (ADS)

Zhao, Jin-Hua; Zhou, Hai-Jun

2014-07-01

Typical-case computation complexity is a research topic at the boundary of computer science, applied mathematics, and statistical physics. In the last twenty years, the replica-symmetry-breaking mean field theory of spin glasses and the associated message-passing algorithms have greatly deepened our understanding of typical-case computation complexity. In this paper, we use the vertex cover problem, a basic nondeterministic-polynomial (NP)-complete combinatorial optimization problem of wide application, as an example to introduce the statistical physical methods and algorithms. We do not go into the technical details but emphasize mainly the intuitive physical meanings of the message-passing equations. A nonfamiliar reader shall be able to understand to a large extent the physics behind the mean field approaches and to adjust the mean field methods in solving other optimization problems.

A Message Passing Approach to Side Chain Positioning with Applications in Protein Docking Refinement *

PubMed Central

Moghadasi, Mohammad; Kozakov, Dima; Mamonov, Artem B.; Vakili, Pirooz; Vajda, Sandor; Paschalidis, Ioannis Ch.

2013-01-01

We introduce a message-passing algorithm to solve the Side Chain Positioning (SCP) problem. SCP is a crucial component of protein docking refinement, which is a key step of an important class of problems in computational structural biology called protein docking. We model SCP as a combinatorial optimization problem and formulate it as a Maximum Weighted Independent Set (MWIS) problem. We then employ a modified and convergent belief-propagation algorithm to solve a relaxation of MWIS and develop randomized estimation heuristics that use the relaxed solution to obtain an effective MWIS feasible solution. Using a benchmark set of protein complexes we demonstrate that our approach leads to more accurate docking predictions compared to a baseline algorithm that does not solve the SCP. PMID:23515575
Multi-partitioning for ADI-schemes on message passing architectures

NASA Technical Reports Server (NTRS)

Vanderwijngaart, Rob F.

1994-01-01

A kind of discrete-operator splitting called Alternating Direction Implicit (ADI) has been found to be useful in simulating fluid flow problems. In particular, it is being used to study the effects of hot exhaust jets from high performance aircraft on landing surfaces. Decomposition techniques that minimize load imbalance and message-passing frequency are described. Three strategies that are investigated for implementing the NAS Scalar Penta-diagonal Parallel Benchmark (SP) are transposition, pipelined Gaussian elimination, and multipartitioning. The multipartitioning strategy, which was used on Ethernet, was found to be the most efficient, although it was considered only a moderate success because of Ethernet's limited communication properties. The efficiency derived largely from the coarse granularity of the strategy, which reduced latencies and allowed overlap of communication and computation.
Ultrascalable petaflop parallel supercomputer

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Chiu, George [Cross River, NY; Cipolla, Thomas M [Katonah, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Hall, Shawn [Pleasantville, NY; Haring, Rudolf A [Cortlandt Manor, NY; Heidelberger, Philip [Cortlandt Manor, NY; Kopcsay, Gerard V [Yorktown Heights, NY; Ohmacht, Martin [Yorktown Heights, NY; Salapura, Valentina [Chappaqua, NY; Sugavanam, Krishnan [Mahopac, NY; Takken, Todd [Brewster, NY

2010-07-20

A massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements. The ASIC nodes are interconnected by multiple independent networks that optimally maximize the throughput of packet communications between nodes with minimal latency. The multiple networks may include three high-speed networks for parallel algorithm message passing including a Torus, collective network, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. The use of a DMA engine is provided to facilitate message passing among the nodes without the expenditure of processing resources at the node.
Humor in print health advertisements: enhanced attention, privileged recognition, and persuasiveness of preventive messages.

PubMed

Blanc, Nathalie; Brigaud, Emmanuelle

2014-01-01

This study tested the effect of humor in one particular type of print advertisement: the preventive health ads for three topics (alcohol, tobacco, obesity). Previous research using commercial ads demonstrated that individuals' attention is spontaneously attracted by humor, leading to a memory advantage for humorous information over nonhumorous information. Two experiments investigated whether the positive effect of humor can occur with preventive health ads. In Experiment 1, participants observed humorous and nonhumorous health ads while their viewing times were recorded. In Experiment 2, to compare humorous and nonhumorous ads, the memory of health messages was assessed through a recognition task and a convincing score was collected. The results confirmed that, compared to nonhumorous health ads, those using humor received prolonged attention, were judged more convincing, and their messages were better recognized. Overall, these findings suggest that humor can be of use in preventive health communication.
Performance Evaluation of Communication Software Systems for Distributed Computing

NASA Technical Reports Server (NTRS)

Fatoohi, Rod

1996-01-01

In recent years there has been an increasing interest in object-oriented distributed computing since it is better quipped to deal with complex systems while providing extensibility, maintainability, and reusability. At the same time, several new high-speed network technologies have emerged for local and wide area networks. However, the performance of networking software is not improving as fast as the networking hardware and the workstation microprocessors. This paper gives an overview and evaluates the performance of the Common Object Request Broker Architecture (CORBA) standard in a distributed computing environment at NASA Ames Research Center. The environment consists of two testbeds of SGI workstations connected by four networks: Ethernet, FDDI, HiPPI, and ATM. The performance results for three communication software systems are presented, analyzed and compared. These systems are: BSD socket programming interface, IONA's Orbix, an implementation of the CORBA specification, and the PVM message passing library. The results show that high-level communication interfaces, such as CORBA and PVM, can achieve reasonable performance under certain conditions.
The AI Bus architecture for distributed knowledge-based systems

NASA Technical Reports Server (NTRS)

Schultz, Roger D.; Stobie, Iain

1991-01-01

The AI Bus architecture is layered, distributed object oriented framework developed to support the requirements of advanced technology programs for an order of magnitude improvement in software costs. The consequent need for highly autonomous computer systems, adaptable to new technology advances over a long lifespan, led to the design of an open architecture and toolbox for building large scale, robust, production quality systems. The AI Bus accommodates a mix of knowledge based and conventional components, running on heterogeneous, distributed real world and testbed environment. The concepts and design is described of the AI Bus architecture and its current implementation status as a Unix C++ library or reusable objects. Each high level semiautonomous agent process consists of a number of knowledge sources together with interagent communication mechanisms based on shared blackboards and message passing acquaintances. Standard interfaces and protocols are followed for combining and validating subsystems. Dynamic probes or demons provide an event driven means for providing active objects with shared access to resources, and each other, while not violating their security.
Message Passing on GPUs

NASA Astrophysics Data System (ADS)

Stuart, J. A.

2011-12-01

This paper explores the challenges in implementing a message passing interface usable on systems with data-parallel processors, and more specifically GPUs. As a case study, we design and implement the ``DCGN'' API on NVIDIA GPUs that is similar to MPI and allows full access to the underlying architecture. We introduce the notion of data-parallel thread-groups as a way to map resources to MPI ranks. We use a method that also allows the data-parallel processors to run autonomously from user-written CPU code. In order to facilitate communication, we use a sleep-based polling system to store and retrieve messages. Unlike previous systems, our method provides both performance and flexibility. By running a test suite of applications with different communication requirements, we find that a tolerable amount of overhead is incurred, somewhere between one and five percent depending on the application, and indicate the locations where this overhead accumulates. We conclude that with innovations in chipsets and drivers, this overhead will be mitigated and provide similar performance to typical CPU-based MPI implementations while providing fully-dynamic communication.
Fencing direct memory access data transfers in a parallel active messaging interface of a parallel computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blocksome, Michael A.; Mamidala, Amith R.

2013-09-03

Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to segments of shared random access memory through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and a segmentmore » of shared memory; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.« less
Regularization with numerical extrapolation for finite and UV-divergent multi-loop integrals

NASA Astrophysics Data System (ADS)

de Doncker, E.; Yuasa, F.; Kato, K.; Ishikawa, T.; Kapenga, J.; Olagbemi, O.

2018-03-01

We give numerical integration results for Feynman loop diagrams such as those covered by Laporta (2000) and by Baikov and Chetyrkin (2010), and which may give rise to loop integrals with UV singularities. We explore automatic adaptive integration using multivariate techniques from the PARINT package for multivariate integration, as well as iterated integration with programs from the QUADPACK package, and a trapezoidal method based on a double exponential transformation. PARINT is layered over MPI (Message Passing Interface), and incorporates advanced parallel/distributed techniques including load balancing among processes that may be distributed over a cluster or a network/grid of nodes. Results are included for 2-loop vertex and box diagrams and for sets of 2-, 3- and 4-loop self-energy diagrams with or without UV terms. Numerical regularization of integrals with singular terms is achieved by linear and non-linear extrapolation methods.
Altered retrieval of melodic information in congenital amusia: insights from dynamic causal modeling of MEG data.

PubMed

Albouy, Philippe; Mattout, Jérémie; Sanchez, Gaëtan; Tillmann, Barbara; Caclin, Anne

2015-01-01

Congenital amusia is a neuro-developmental disorder that primarily manifests as a difficulty in the perception and memory of pitch-based materials, including music. Recent findings have shown that the amusic brain exhibits altered functioning of a fronto-temporal network during pitch perception and short-term memory. Within this network, during the encoding of melodies, a decreased right backward frontal-to-temporal connectivity was reported in amusia, along with an abnormal connectivity within and between auditory cortices. The present study investigated whether connectivity patterns between these regions were affected during the short-term memory retrieval of melodies. Amusics and controls had to indicate whether sequences of six tones that were presented in pairs were the same or different. When melodies were different only one tone changed in the second melody. Brain responses to the changed tone in "Different" trials and to its equivalent (original) tone in "Same" trials were compared between groups using Dynamic Causal Modeling (DCM). DCM results confirmed that congenital amusia is characterized by an altered effective connectivity within and between the two auditory cortices during sound processing. Furthermore, right temporal-to-frontal message passing was altered in comparison to controls, with notably an increase in "Same" trials. An additional analysis in control participants emphasized that the detection of an unexpected event in the typically functioning brain is supported by right fronto-temporal connections. The results can be interpreted in a predictive coding framework as reflecting an abnormal prediction error sent by temporal auditory regions towards frontal areas in the amusic brain.
Memory Transformation Enhances Reinforcement Learning in Dynamic Environments.

PubMed

Santoro, Adam; Frankland, Paul W; Richards, Blake A

2016-11-30

Over the course of systems consolidation, there is a switch from a reliance on detailed episodic memories to generalized schematic memories. This switch is sometimes referred to as "memory transformation." Here we demonstrate a previously unappreciated benefit of memory transformation, namely, its ability to enhance reinforcement learning in a dynamic environment. We developed a neural network that is trained to find rewards in a foraging task where reward locations are continuously changing. The network can use memories for specific locations (episodic memories) and statistical patterns of locations (schematic memories) to guide its search. We find that switching from an episodic to a schematic strategy over time leads to enhanced performance due to the tendency for the reward location to be highly correlated with itself in the short-term, but regress to a stable distribution in the long-term. We also show that the statistics of the environment determine the optimal utilization of both types of memory. Our work recasts the theoretical question of why memory transformation occurs, shifting the focus from the avoidance of memory interference toward the enhancement of reinforcement learning across multiple timescales. As time passes, memories transform from a highly detailed state to a more gist-like state, in a process called "memory transformation." Theories of memory transformation speak to its advantages in terms of reducing memory interference, increasing memory robustness, and building models of the environment. However, the role of memory transformation from the perspective of an agent that continuously acts and receives reward in its environment is not well explored. In this work, we demonstrate a view of memory transformation that defines it as a way of optimizing behavior across multiple timescales. Copyright © 2016 the authors 0270-6474/16/3612228-15$15.00/0.
Words with Friends: Effects of Associative and Semantic Relationships on Subject-Verb Agreement Errors during Sentence Production

ERIC Educational Resources Information Center

Penta, Darrell J.

2017-01-01

The sentence production system transforms preverbal messages in the mind of a speaker into coherent grammatical utterances. During this process, which unfolds rapidly, the system has to link meaning information from the speaker's message to appropriate lexical and grammatical information from the speaker's memory. It usually does so with fluency…
Embedded real-time operating system micro kernel design

NASA Astrophysics Data System (ADS)

Cheng, Xiao-hui; Li, Ming-qiang; Wang, Xin-zheng

2005-12-01

Embedded systems usually require a real-time character. Base on an 8051 microcontroller, an embedded real-time operating system micro kernel is proposed consisting of six parts, including a critical section process, task scheduling, interruption handle, semaphore and message mailbox communication, clock managent and memory managent. Distributed CPU and other resources are among tasks rationally according to the importance and urgency. The design proposed here provides the position, definition, function and principle of micro kernel. The kernel runs on the platform of an ATMEL AT89C51 microcontroller. Simulation results prove that the designed micro kernel is stable and reliable and has quick response while operating in an application system.
Passing Messages between Biological Networks to Refine Predicted Interactions

PubMed Central

Glass, Kimberly; Huttenhower, Curtis; Quackenbush, John; Yuan, Guo-Cheng

2013-01-01

Regulatory network reconstruction is a fundamental problem in computational biology. There are significant limitations to such reconstruction using individual datasets, and increasingly people attempt to construct networks using multiple, independent datasets obtained from complementary sources, but methods for this integration are lacking. We developed PANDA (Passing Attributes between Networks for Data Assimilation), a message-passing model using multiple sources of information to predict regulatory relationships, and used it to integrate protein-protein interaction, gene expression, and sequence motif data to reconstruct genome-wide, condition-specific regulatory networks in yeast as a model. The resulting networks were not only more accurate than those produced using individual data sets and other existing methods, but they also captured information regarding specific biological mechanisms and pathways that were missed using other methodologies. PANDA is scalable to higher eukaryotes, applicable to specific tissue or cell type data and conceptually generalizable to include a variety of regulatory, interaction, expression, and other genome-scale data. An implementation of the PANDA algorithm is available at www.sourceforge.net/projects/panda-net. PMID:23741402
Applications of Probabilistic Combiners on Linear Feedback Shift Register Sequences

DTIC Science & Technology

2016-12-01

on the resulting output strings show a drastic increase in complexity, while simultaneously passing the stringent randomness tests required by the...a three-variable function. Our tests on the resulting output strings show a drastic increase in complex- ity, while simultaneously passing the...10001101 01000010 11101001 Decryption of a message that has been encrypted using bitwise XOR is quite simple. Since each bit is its own additive inverse
Statistical mechanics of complex neural systems and high dimensional data

NASA Astrophysics Data System (ADS)

Advani, Madhu; Lahiri, Subhaneil; Ganguli, Surya

2013-03-01

Recent experimental advances in neuroscience have opened new vistas into the immense complexity of neuronal networks. This proliferation of data challenges us on two parallel fronts. First, how can we form adequate theoretical frameworks for understanding how dynamical network processes cooperate across widely disparate spatiotemporal scales to solve important computational problems? Second, how can we extract meaningful models of neuronal systems from high dimensional datasets? To aid in these challenges, we give a pedagogical review of a collection of ideas and theoretical methods arising at the intersection of statistical physics, computer science and neurobiology. We introduce the interrelated replica and cavity methods, which originated in statistical physics as powerful ways to quantitatively analyze large highly heterogeneous systems of many interacting degrees of freedom. We also introduce the closely related notion of message passing in graphical models, which originated in computer science as a distributed algorithm capable of solving large inference and optimization problems involving many coupled variables. We then show how both the statistical physics and computer science perspectives can be applied in a wide diversity of contexts to problems arising in theoretical neuroscience and data analysis. Along the way we discuss spin glasses, learning theory, illusions of structure in noise, random matrices, dimensionality reduction and compressed sensing, all within the unified formalism of the replica method. Moreover, we review recent conceptual connections between message passing in graphical models, and neural computation and learning. Overall, these ideas illustrate how statistical physics and computer science might provide a lens through which we can uncover emergent computational functions buried deep within the dynamical complexities of neuronal networks.
A parallel solver for huge dense linear systems

NASA Astrophysics Data System (ADS)

Badia, J. M.; Movilla, J. L.; Climente, J. I.; Castillo, M.; Marqués, M.; Mayo, R.; Quintana-Ortí, E. S.; Planelles, J.

2011-11-01

HDSS (Huge Dense Linear System Solver) is a Fortran Application Programming Interface (API) to facilitate the parallel solution of very large dense systems to scientists and engineers. The API makes use of parallelism to yield an efficient solution of the systems on a wide range of parallel platforms, from clusters of processors to massively parallel multiprocessors. It exploits out-of-core strategies to leverage the secondary memory in order to solve huge linear systems O(100.000). The API is based on the parallel linear algebra library PLAPACK, and on its Out-Of-Core (OOC) extension POOCLAPACK. Both PLAPACK and POOCLAPACK use the Message Passing Interface (MPI) as the communication layer and BLAS to perform the local matrix operations. The API provides a friendly interface to the users, hiding almost all the technical aspects related to the parallel execution of the code and the use of the secondary memory to solve the systems. In particular, the API can automatically select the best way to store and solve the systems, depending of the dimension of the system, the number of processes and the main memory of the platform. Experimental results on several parallel platforms report high performance, reaching more than 1 TFLOP with 64 cores to solve a system with more than 200 000 equations and more than 10 000 right-hand side vectors. New version program summaryProgram title: Huge Dense System Solver (HDSS) Catalogue identifier: AEHU_v1_1 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEHU_v1_1.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 87 062 No. of bytes in distributed program, including test data, etc.: 1 069 110 Distribution format: tar.gz Programming language: Fortran90, C Computer: Parallel architectures: multiprocessors, computer clusters Operating system: Linux/Unix Has the code been vectorized or parallelized?: Yes, includes MPI primitives. RAM: Tested for up to 190 GB Classification: 6.5 External routines: MPI ( http://www.mpi-forum.org/), BLAS ( http://www.netlib.org/blas/), PLAPACK ( http://www.cs.utexas.edu/~plapack/), POOCLAPACK ( ftp://ftp.cs.utexas.edu/pub/rvdg/PLAPACK/pooclapack.ps) (code for PLAPACK and POOCLAPACK is included in the distribution). Catalogue identifier of previous version: AEHU_v1_0 Journal reference of previous version: Comput. Phys. Comm. 182 (2011) 533 Does the new version supersede the previous version?: Yes Nature of problem: Huge scale dense systems of linear equations, Ax=B, beyond standard LAPACK capabilities. Solution method: The linear systems are solved by means of parallelized routines based on the LU factorization, using efficient secondary storage algorithms when the available main memory is insufficient. Reasons for new version: In many applications we need to guarantee a high accuracy in the solution of very large linear systems and we can do it by using double-precision arithmetic. Summary of revisions: Version 1.1 Can be used to solve linear systems using double-precision arithmetic. New version of the initialization routine. The user can choose the kind of arithmetic and the values of several parameters of the environment. Running time: About 5 hours to solve a system with more than 200 000 equations and more than 10 000 right-hand side vectors using double-precision arithmetic on an eight-node commodity cluster with a total of 64 Intel cores.
Deceived, Disgusted, and Defensive: Motivated Processing of Anti-Tobacco Advertisements.

PubMed

Leshner, Glenn; Clayton, Russell B; Bolls, Paul D; Bhandari, Manu

2017-08-29

A 2 × 2 experiment was conducted, where participants watched anti-tobacco messages that varied in deception (content portraying tobacco companies as dishonest) and disgust (negative graphic images) content. Psychophysiological measures, self-report, and a recognition test were used to test hypotheses generated from the motivated cognition framework. The results of this study indicate that messages containing both deception and disgust push viewers into a cascade of defensive responses reflected by increased self-reported unpleasantness, reduced resources allocated to encoding, worsened recognition memory, and dampened emotional responses compared to messages depicting one attribute or neither. Findings from this study demonstrate the value of applying a motivated cognition theoretical framework in research on responses to emotional content in health messages and support previous research on defensive processing and message design of anti-tobacco messages.
Long-Term Memory and Learning

ERIC Educational Resources Information Center

Crossland, John

2011-01-01

The English National Curriculum Programmes of Study emphasise the importance of knowledge, understanding and skills, and teachers are well versed in structuring learning in those terms. Research outcomes into how long-term memory is stored and retrieved provide support for structuring learning in this way. Four further messages are added to the…
Earth Observation

NASA Image and Video Library

2013-07-26

Earth observation taken during day pass by an Expedition 36 crew member on board the International Space Station (ISS). Per Twitter message: Never tire of finding shapes in the clouds! These look very botanical to me. Simply perfect.

PREEMIE Reauthorization Act

THOMAS, 112th Congress

Sen. Alexander, Lamar [R-TN

2011-07-28

Senate - 12/19/2012 Message on House action received in Senate and at desk: House amendments to Senate bill. (All Actions) Tracker: This bill has the status Passed HouseHere are the steps for Status of Legislation:
Electronic Cigarettes on Twitter – Spreading the Appeal of Flavors

PubMed Central

Chu, Kar-Hai; Unger, Jennifer B.; Cruz, Tess Boley; Soto, Daniel W.

2016-01-01

Objectives Social media platforms are used by tobacco companies to promote products. This study examines message content on Twitter from e-cigarette brands and determines if messages about flavors are more likely than non-flavor messages to be passed along to other viewers. Methods We examined Twitter data from 2 e-cigarette brands and identified messages that contained terms related to e-cigarette flavors. Results Flavor-related posts were retweeted at a significantly higher rate by e-cigarette brands (p = .04) and other Twitter users (p < .01) than non-flavor posts. Conclusions E-cigarette brands and other Twitter users pay attention to flavor-related posts and retweet them often. These findings suggest flavors continue to be an attractive characteristic and their marketing should be monitored closely. PMID:27853734
Identifying an influential spreader from a single seed in complex networks via a message-passing approach

NASA Astrophysics Data System (ADS)

Min, Byungjoon

2018-01-01

Identifying the most influential spreaders is one of outstanding problems in physics of complex systems. So far, many approaches have attempted to rank the influence of nodes but there is still the lack of accuracy to single out influential spreaders. Here, we directly tackle the problem of finding important spreaders by solving analytically the expected size of epidemic outbreaks when spreading originates from a single seed. We derive and validate a theory for calculating the size of epidemic outbreaks with a single seed using a message-passing approach. In addition, we find that the probability to occur epidemic outbreaks is highly dependent on the location of the seed but the size of epidemic outbreaks once it occurs is insensitive to the seed. We also show that our approach can be successfully adapted into weighted networks.
The design of multi-core DSP parallel model based on message passing and multi-level pipeline

NASA Astrophysics Data System (ADS)

Niu, Jingyu; Hu, Jian; He, Wenjing; Meng, Fanrong; Li, Chuanrong

2017-10-01

Currently, the design of embedded signal processing system is often based on a specific application, but this idea is not conducive to the rapid development of signal processing technology. In this paper, a parallel processing model architecture based on multi-core DSP platform is designed, and it is mainly suitable for the complex algorithms which are composed of different modules. This model combines the ideas of multi-level pipeline parallelism and message passing, and summarizes the advantages of the mainstream model of multi-core DSP (the Master-Slave model and the Data Flow model), so that it has better performance. This paper uses three-dimensional image generation algorithm to validate the efficiency of the proposed model by comparing with the effectiveness of the Master-Slave and the Data Flow model.
Generalized hypercube structures and hyperswitch communication network

NASA Technical Reports Server (NTRS)

Young, Steven D.

1992-01-01

This paper discusses an ongoing study that uses a recent development in communication control technology to implement hybrid hypercube structures. These architectures are similar to binary hypercubes, but they also provide added connectivity between the processors. This added connectivity increases communication reliability while decreasing the latency of interprocessor message passing. Because these factors directly determine the speed that can be obtained by multiprocessor systems, these architectures are attractive for applications such as remote exploration and experimentation, where high performance and ultrareliability are required. This paper describes and enumerates these architectures and discusses how they can be implemented with a modified version of the hyperswitch communication network (HCN). The HCN is analyzed because it has three attractive features that enable these architectures to be effective: speed, fault tolerance, and the ability to pass multiple messages simultaneously through the same hyperswitch controller.
Quantum Secure Direct Communication with Quantum Memory

NASA Astrophysics Data System (ADS)

Zhang, Wei; Ding, Dong-Sheng; Sheng, Yu-Bo; Zhou, Lan; Shi, Bao-Sen; Guo, Guang-Can

2017-06-01

Quantum communication provides an absolute security advantage, and it has been widely developed over the past 30 years. As an important branch of quantum communication, quantum secure direct communication (QSDC) promotes high security and instantaneousness in communication through directly transmitting messages over a quantum channel. The full implementation of a quantum protocol always requires the ability to control the transfer of a message effectively in the time domain; thus, it is essential to combine QSDC with quantum memory to accomplish the communication task. In this Letter, we report the experimental demonstration of QSDC with state-of-the-art atomic quantum memory for the first time in principle. We use the polarization degrees of freedom of photons as the information carrier, and the fidelity of entanglement decoding is verified as approximately 90%. Our work completes a fundamental step toward practical QSDC and demonstrates a potential application for long-distance quantum communication in a quantum network.
Quantum Secure Direct Communication with Quantum Memory.

PubMed

Zhang, Wei; Ding, Dong-Sheng; Sheng, Yu-Bo; Zhou, Lan; Shi, Bao-Sen; Guo, Guang-Can

2017-06-02

Quantum communication provides an absolute security advantage, and it has been widely developed over the past 30 years. As an important branch of quantum communication, quantum secure direct communication (QSDC) promotes high security and instantaneousness in communication through directly transmitting messages over a quantum channel. The full implementation of a quantum protocol always requires the ability to control the transfer of a message effectively in the time domain; thus, it is essential to combine QSDC with quantum memory to accomplish the communication task. In this Letter, we report the experimental demonstration of QSDC with state-of-the-art atomic quantum memory for the first time in principle. We use the polarization degrees of freedom of photons as the information carrier, and the fidelity of entanglement decoding is verified as approximately 90%. Our work completes a fundamental step toward practical QSDC and demonstrates a potential application for long-distance quantum communication in a quantum network.
Design structure for in-system redundant array repair in integrated circuits

DOEpatents

Bright, Arthur A.; Crumley, Paul G.; Dombrowa, Marc; Douskey, Steven M.; Haring, Rudolf A.; Oakland, Steven F.; Quellette, Michael R.; Strissel, Scott A.

2008-11-25

A design structure for repairing an integrated circuit during operation of the integrated circuit. The integrated circuit comprising of a multitude of memory arrays and a fuse box holding control data for controlling redundancy logic of the arrays. The design structure provides the integrated circuit with a control data selector for passing the control data from the fuse box to the memory arrays; providing a source of alternate control data, external of the integrated circuit; and connecting the source of alternate control data to the control data selector. The design structure further passes the alternate control data from the source thereof, through the control data selector and to the memory arrays to control the redundancy logic of the memory arrays.
A Prize-Collecting Steiner Tree Approach for Transduction Network Inference

NASA Astrophysics Data System (ADS)

Bailly-Bechet, Marc; Braunstein, Alfredo; Zecchina, Riccardo

Into the cell, information from the environment is mainly propagated via signaling pathways which form a transduction network. Here we propose a new algorithm to infer transduction networks from heterogeneous data, using both the protein interaction network and expression datasets. We formulate the inference problem as an optimization task, and develop a message-passing, probabilistic and distributed formalism to solve it. We apply our algorithm to the pheromone response in the baker’s yeast S. cerevisiae. We are able to find the backbone of the known structure of the MAPK cascade of pheromone response, validating our algorithm. More importantly, we make biological predictions about some proteins whose role could be at the interface between pheromone response and other cellular functions.
Numerical computation of solar neutrino flux attenuated by the MSW mechanism

NASA Astrophysics Data System (ADS)

Kim, Jai Sam; Chae, Yoon Sang; Kim, Jung Dae

1999-07-01

We compute the survival probability of an electron neutrino in its flight through the solar core experiencing the Mikheyev-Smirnov-Wolfenstein effect with all three neutrino species considered. We adopted a hybrid method that uses an accurate approximation formula in the non-resonance region and numerical integration in the non-adiabatic resonance region. The key of our algorithm is to use the importance sampling method for sampling the neutrino creation energy and position and to find the optimum radii to start and stop numerical integration. We further developed a parallel algorithm for a message passing parallel computer. By using an idea of job token, we have developed a dynamical load balancing mechanism which is effective under any irregular load distributions
Practical Formal Verification of MPI and Thread Programs

NASA Astrophysics Data System (ADS)

Gopalakrishnan, Ganesh; Kirby, Robert M.

Large-scale simulation codes in science and engineering are written using the Message Passing Interface (MPI). Shared memory threads are widely used directly, or to implement higher level programming abstractions. Traditional debugging methods for MPI or thread programs are incapable of providing useful formal guarantees about coverage. They get bogged down in the sheer number of interleavings (schedules), often missing shallow bugs. In this tutorial we will introduce two practical formal verification tools: ISP (for MPI C programs) and Inspect (for Pthread C programs). Unlike other formal verification tools, ISP and Inspect run directly on user source codes (much like a debugger). They pursue only the relevant set of process interleavings, using our own customized Dynamic Partial Order Reduction algorithms. For a given test harness, DPOR allows these tools to guarantee the absence of deadlocks, instrumented MPI object leaks and communication races (using ISP), and shared memory races (using Inspect). ISP and Inspect have been used to verify large pieces of code: in excess of 10,000 lines of MPI/C for ISP in under 5 seconds, and about 5,000 lines of Pthread/C code in a few hours (and much faster with the use of a cluster or by exploiting special cases such as symmetry) for Inspect. We will also demonstrate the Microsoft Visual Studio and Eclipse Parallel Tools Platform integrations of ISP (these will be available on the LiveCD).
Computation of scattering matrix elements of large and complex shaped absorbing particles with multilevel fast multipole algorithm

NASA Astrophysics Data System (ADS)

Wu, Yueqian; Yang, Minglin; Sheng, Xinqing; Ren, Kuan Fang

2015-05-01

Light scattering properties of absorbing particles, such as the mineral dusts, attract a wide attention due to its importance in geophysical and environment researches. Due to the absorbing effect, light scattering properties of particles with absorption differ from those without absorption. Simple shaped absorbing particles such as spheres and spheroids have been well studied with different methods but little work on large complex shaped particles has been reported. In this paper, the surface Integral Equation (SIE) with Multilevel Fast Multipole Algorithm (MLFMA) is applied to study scattering properties of large non-spherical absorbing particles. SIEs are carefully discretized with piecewise linear basis functions on triangle patches to model whole surface of the particle, hence computation resource needs increase much more slowly with the particle size parameter than the volume discretized methods. To improve further its capability, MLFMA is well parallelized with Message Passing Interface (MPI) on distributed memory computer platform. Without loss of generality, we choose the computation of scattering matrix elements of absorbing dust particles as an example. The comparison of the scattering matrix elements computed by our method and the discrete dipole approximation method (DDA) for an ellipsoid dust particle shows that the precision of our method is very good. The scattering matrix elements of large ellipsoid dusts with different aspect ratios and size parameters are computed. To show the capability of the presented algorithm for complex shaped particles, scattering by asymmetry Chebyshev particle with size parameter larger than 600 of complex refractive index m = 1.555 + 0.004 i and different orientations are studied.
Advanced computational techniques for incompressible/compressible fluid-structure interactions

NASA Astrophysics Data System (ADS)

Kumar, Vinod

2005-07-01

Fluid-Structure Interaction (FSI) problems are of great importance to many fields of engineering and pose tremendous challenges to numerical analyst. This thesis addresses some of the hurdles faced for both 2D and 3D real life time-dependent FSI problems with particular emphasis on parachute systems. The techniques developed here would help improve the design of parachutes and are of direct relevance to several other FSI problems. The fluid system is solved using the Deforming-Spatial-Domain/Stabilized Space-Time (DSD/SST) finite element formulation for the Navier-Stokes equations of incompressible and compressible flows. The structural dynamics solver is based on a total Lagrangian finite element formulation. Newton-Raphson method is employed to linearize the otherwise nonlinear system resulting from the fluid and structure formulations. The fluid and structural systems are solved in decoupled fashion at each nonlinear iteration. While rigorous coupling methods are desirable for FSI simulations, the decoupled solution techniques provide sufficient convergence in the time-dependent problems considered here. In this thesis, common problems in the FSI simulations of parachutes are discussed and possible remedies for a few of them are presented. Further, the effects of the porosity model on the aerodynamic forces of round parachutes are analyzed. Techniques for solving compressible FSI problems are also discussed. Subsequently, a better stabilization technique is proposed to efficiently capture and accurately predict the shocks in supersonic flows. The numerical examples simulated here require high performance computing. Therefore, numerical tools using distributed memory supercomputers with message passing interface (MPI) libraries were developed.
Recall Performance of Children Failing Memory Portions of a Speech--Language--Memory Screening Battery.

ERIC Educational Resources Information Center

Tobey, Emily A.; And Others

1982-01-01

Recall performance of 22 first-grade and third-grade children who failed memory portions of a speech-language-memory screen was examined using digit and consonant-vowel (CV) stimulus sets. Data indicate children failing the screening battery differed quantitatively, rather than qualitatively, from children passing the screening batter. (Author)
Earth Observation

NASA Image and Video Library

2013-08-03

Earth observation taken during day pass by an Expedition 36 crew member on board the International Space Station (ISS). Per Twitter message: From southernmost point of orbit over the South Pacific- all clouds seemed to be leading to the South Pole.
Earth Observation

NASA Image and Video Library

2013-07-21

Earth observation taken during night pass by an Expedition 36 crew member on board the International Space Station (ISS). Per Twitter message this is labeled as : Tehran, Iran. Lights along the coast of the Caspian Sea visible through clouds. July 21.
Whistleblower Protection Enhancement Act of 2010

THOMAS, 111th Congress

Sen. Akaka, Daniel K. [D-HI

2009-02-03

Senate - 12/22/2010 Message on House action received in Senate and at desk: House amendments to Senate bill. (All Actions) Tracker: This bill has the status Passed HouseHere are the steps for Status of Legislation:
Earth Observation

NASA Image and Video Library

2014-05-31

Earth Observation taken during a day pass by the Expedition 40 crew aboard the International Space Station (ISS). Folder lists this as: CEO - Arena de Sao Paolo. View used for Twitter message: Cloudy skies over São Paulo Brazil
The Effect of NUMA Tunings on CPU Performance

NASA Astrophysics Data System (ADS)

Hollowell, Christopher; Caramarcu, Costin; Strecker-Kellogg, William; Wong, Antonio; Zaytsev, Alexandr

2015-12-01

Non-Uniform Memory Access (NUMA) is a memory architecture for symmetric multiprocessing (SMP) systems where each processor is directly connected to separate memory. Indirect access to other CPU's (remote) RAM is still possible, but such requests are slower as they must also pass through that memory's controlling CPU. In concert with a NUMA-aware operating system, the NUMA hardware architecture can help eliminate the memory performance reductions generally seen in SMP systems when multiple processors simultaneously attempt to access memory. The x86 CPU architecture has supported NUMA for a number of years. Modern operating systems such as Linux support NUMA-aware scheduling, where the OS attempts to schedule a process to the CPU directly attached to the majority of its RAM. In Linux, it is possible to further manually tune the NUMA subsystem using the numactl utility. With the release of Red Hat Enterprise Linux (RHEL) 6.3, the numad daemon became available in this distribution. This daemon monitors a system's NUMA topology and utilization, and automatically makes adjustments to optimize locality. As the number of cores in x86 servers continues to grow, efficient NUMA mappings of processes to CPUs/memory will become increasingly important. This paper gives a brief overview of NUMA, and discusses the effects of manual tunings and numad on the performance of the HEPSPEC06 benchmark, and ATLAS software.
Measuring Memory and Attention to Preview in Motion.

PubMed

Jagacinski, Richard J; Hammond, Gordon M; Rizzi, Emanuele

2017-08-01

Objective Use perceptual-motor responses to perturbations to reveal the spatio-temporal detail of memory for the recent past and attention to preview when participants track a winding roadway. Background Memory of the recently passed roadway can be inferred from feedback control models of the participants' manual movement patterns. Similarly, attention to preview of the upcoming roadway can be inferred from feedforward control models of manual movement patterns. Method Perturbation techniques were used to measure these memory and attention functions. Results In a laboratory tracking task, the bandwidth of lateral roadway deviations was found to primarily influence memory for the past roadway rather than attention to preview. A secondary auditory/verbal/vocal memory task resulted in higher velocity error and acceleration error in the tracking task but did not affect attention to preview. Attention to preview was affected by the frequency pattern of sinusoidal perturbations of the roadway. Conclusion Perturbation techniques permit measurement of the spatio-temporal span of memory and attention to preview that affect tracking a winding roadway. They also provide new ways to explore goal-directed forgetting and spatially distributed attention in the context of movement. More generally, these techniques provide sensitive measures of individual differences in cognitive aspects of action. Application Models of driving behavior and assessment of driving skill may benefit from more detailed spatio-temporal measurement of attention to preview.

Self-pacing direct memory access data transfer operations for compute nodes in a parallel computer

DOEpatents

Blocksome, Michael A

2015-02-17

Methods, apparatus, and products are disclosed for self-pacing DMA data transfer operations for nodes in a parallel computer that include: transferring, by an origin DMA on an origin node, a RTS message to a target node, the RTS message specifying an message on the origin node for transfer to the target node; receiving, in an origin injection FIFO for the origin DMA from a target DMA on the target node in response to transferring the RTS message, a target RGET descriptor followed by a DMA transfer operation descriptor, the DMA descriptor for transmitting a message portion to the target node, the target RGET descriptor specifying an origin RGET descriptor on the origin node that specifies an additional DMA descriptor for transmitting an additional message portion to the target node; processing, by the origin DMA, the target RGET descriptor; and processing, by the origin DMA, the DMA transfer operation descriptor.
A Connectionist/Control Architecture for Working Memory and Workload: Why Working Memory Is Not 7+ /-2

DTIC Science & Technology

1987-09-29

load on the centra and regional controllers. Strategy #S: Reduce message Chase. W.G., & Ericsson. K.A. ( 1082 ). Skill andInterference of concurrently...bypsycholocy Lf learning and motivation. Vol. Ia. strengthening region. to,.region connections on theNeYokAcdmcPs. innerlooip. Crowder, R.G. ( 1082
Effects of cell-phone and text-message distractions on true and false recognition.

PubMed

Smith, Theodore S; Isaak, Matthew I; Senette, Christian G; Abadie, Brenton G

2011-06-01

This study examined the effects of electronic communication distractions, including cell-phone and texting demands, on true and false recognition, specifically semantically related words presented and not presented on a computer screen. Participants were presented with 24 Deese-Roediger-McDermott (DRM) lists while manipulating the concurrent presence or absence of cell-phone and text-message distractions during study. In the DRM paradigm, participants study lists of semantically related words (e.g., mother, crib, and diaper) linked to a non-presented critical lure (e.g., baby). After studying the lists of words, participants are then requested to recall or recognize previously presented words. Participants often not only demonstrate high remembrance for presented words (true memory: crib), but also recollection for non-presented words (false memory: baby). In the present study, true memory was highest when participants were not presented with any distraction tasks during study of DRM words, but poorer when they were required to complete a cell-phone conversation or text-message task during study. False recognition measures did not statistically vary across distraction conditions. Signal detection analyses showed that participants better discriminated true targets (list items presented during study) from true target controls (items presented during study only) when cell-phone or text-message distractions were absent than when they were present. Response bias did not vary significantly across distraction conditions, as there were no differences in the likelihood that a participant would claim an item as "old" (previously presented) rather than "new" (not previously presented). Results of this study are examined with respect to both activation monitoring and fuzzy trace theories.
Multi-level scaling properties of instant-message communications

NASA Astrophysics Data System (ADS)

Chen, Guanxiong; Han, Xiaopu; Wang, Binghong

2010-08-01

To research the statistical properties of human's communication behaviors is one of the highlight areas of Human Dynamics. In this paper, we analyze the instant message data of QICQ from volunteers, and discover that there are many forms of non-Poisson characters, such as inter-event distributions of sending and receiving messages, communications between two friends, log-in activities, the distribution of online time, quantities of messages, and so on. These distributions not only denote the pattern of human communication activities, but also relate to the statistical property of human behaviors in using software. We find out that most of these exponents distribute between -1 and -2, which indicates that the Instant Message (IM) communication behavior of human is different from Non-IM communication behaviors; there are many fat-tail characters related to IM communication behavior.
Method and apparatus for in-system redundant array repair on integrated circuits

DOEpatents

Bright, Arthur A [Croton-on-Hudson, NY; Crumley, Paul G [Yorktown Heights, NY; Dombrowa, Marc B [Bronx, NY; Douskey, Steven M [Rochester, MN; Haring, Rudolf A [Cortlandt Manor, NY; Oakland, Steven F [Colchester, VT; Ouellette, Michael R [Westford, VT; Strissel, Scott A [Byron, MN

2008-07-29

Disclosed is a method of repairing an integrated circuit of the type comprising of a multitude of memory arrays and a fuse box holding control data for controlling redundancy logic of the arrays. The method comprises the steps of providing the integrated circuit with a control data selector for passing the control data from the fuse box to the memory arrays; providing a source of alternate control data, external of the integrated circuit; and connecting the source of alternate control data to the control data selector. The method comprises the further step of, at a given time, passing the alternate control data from the source thereof, through the control data selector and to the memory arrays to control the redundancy logic of the memory arrays.
Method and apparatus for in-system redundant array repair on integrated circuits

DOEpatents

Bright, Arthur A [Croton-on-Hudson, NY; Crumley, Paul G [Yorktown Heights, NY; Dombrowa, Marc B [Bronx, NY; Douskey, Steven M [Rochester, MN; Haring, Rudolf A [Cortlandt Manor, NY; Oakland, Steven F [Colchester, VT; Ouellette, Michael R [Westford, VT; Strissel, Scott A [Byron, MN

2008-07-08

Disclosed is a method of repairing an integrated circuit of the type comprising of a multitude of memory arrays and a fuse box holding control data for controlling redundancy logic of the arrays. The method comprises the steps of providing the integrated circuit with a control data selector for passing the control data from the fuse box to the memory arrays; providing a source of alternate control data, external of the integrated circuit; and connecting the source of alternate control data to the control data selector. The method comprises the further step of, at a given time, passing the alternate control data from the source thereof, through the control data selector and to the memory arrays to control the redundancy logic of the memory arrays.
Method and apparatus for in-system redundant array repair on integrated circuits

DOEpatents

Bright, Arthur A.; Crumley, Paul G.; Dombrowa, Marc B.; Douskey, Steven M.; Haring, Rudolf A.; Oakland, Steven F.; Ouellette, Michael R.; Strissel, Scott A.

2007-12-18

Disclosed is a method of repairing an integrated circuit of the type comprising of a multitude of memory arrays and a fuse box holding control data for controlling redundancy logic of the arrays. The method comprises the steps of providing the integrated circuit with a control data selector for passing the control data from the fuse box to the memory arrays; providing a source of alternate control data, external of the integrated circuit; and connecting the source of alternate control data to the control data selector. The method comprises the further step of, at a given time, passing the alternate control data from the source thereof, through the control data selector and to the memory arrays to control the redundancy logic of the memory arrays.
Shape memory polymer medical device

DOEpatents

Maitland, Duncan [Pleasant Hill, CA; Benett, William J [Livermore, CA; Bearinger, Jane P [Livermore, CA; Wilson, Thomas S [San Leandro, CA; Small, IV, Ward; Schumann, Daniel L [Concord, CA; Jensen, Wayne A [Livermore, CA; Ortega, Jason M [Pacifica, CA; Marion, III, John E.; Loge, Jeffrey M [Stockton, CA

2010-06-29

A system for removing matter from a conduit. The system includes the steps of passing a transport vehicle and a shape memory polymer material through the conduit, transmitting energy to the shape memory polymer material for moving the shape memory polymer material from a first shape to a second and different shape, and withdrawing the transport vehicle and the shape memory polymer material through the conduit carrying the matter.
Terrorism Risk Insurance Program Reauthorization Act of 2014

THOMAS, 113th Congress

Sen. Schumer, Charles E. [D-NY

2014-04-10

Senate - 12/11/2014 Message on House action received in Senate and at desk: House amendment to Senate bill. (All Actions) Tracker: This bill has the status Passed HouseHere are the steps for Status of Legislation:
Electronic Message Preservation Act

THOMAS, 111th Congress

Rep. Hodes, Paul W. [D-NH-2

2009-03-09

Senate - 03/18/2010 Received in the Senate and Read twice and referred to the Committee on Homeland Security and Governmental Affairs. (All Actions) Tracker: This bill has the status Passed HouseHere are the steps for Status of Legislation:
Federal Financial Assistance Management Improvement Act of 2009

THOMAS, 111th Congress

Sen. Voinovich, George V. [R-OH

2009-01-22

Senate - 12/15/2009 Message on House action received in Senate and at desk: House amendment to Senate bill. (All Actions) Tracker: This bill has the status Passed HouseHere are the steps for Status of Legislation:
Message-passing-interface-based parallel FDTD investigation on the EM scattering from a 1-D rough sea surface using uniaxial perfectly matched layer absorbing boundary.

PubMed

Li, J; Guo, L-X; Zeng, H; Han, X-B

2009-06-01

A message-passing-interface (MPI)-based parallel finite-difference time-domain (FDTD) algorithm for the electromagnetic scattering from a 1-D randomly rough sea surface is presented. The uniaxial perfectly matched layer (UPML) medium is adopted for truncation of FDTD lattices, in which the finite-difference equations can be used for the total computation domain by properly choosing the uniaxial parameters. This makes the parallel FDTD algorithm easier to implement. The parallel performance with different processors is illustrated for one sea surface realization, and the computation time of the parallel FDTD algorithm is dramatically reduced compared to a single-process implementation. Finally, some numerical results are shown, including the backscattering characteristics of sea surface for different polarization and the bistatic scattering from a sea surface with large incident angle and large wind speed.
Probability Distributions over Cryptographic Protocols

DTIC Science & Technology

2009-06-01

Artificial Immune Algorithm . . . . . . . . . . . . . . . . . . . 9 3 Design Decisions 11 3.1 Common Ground...creation algorithm for unbounded distribution . . . . . . . 24 4.2 Message creation algorithm for unbounded naive distribution . . . . 24 4.3 Protocol...creation algorithm for intended-run distributions . . . . . . 26 4.4 Protocol and message creation algorithm for realistic distribution . . 32 ix THIS
A Routing Path Construction Method for Key Dissemination Messages in Sensor Networks

PubMed Central

Moon, Soo Young; Cho, Tae Ho

2014-01-01

Authentication is an important security mechanism for detecting forged messages in a sensor network. Each cluster head (CH) in dynamic key distribution schemes forwards a key dissemination message that contains encrypted authentication keys within its cluster to next-hop nodes for the purpose of authentication. The forwarding path of the key dissemination message strongly affects the number of nodes to which the authentication keys in the message are actually distributed. We propose a routing method for the key dissemination messages to increase the number of nodes that obtain the authentication keys. In the proposed method, each node selects next-hop nodes to which the key dissemination message will be forwarded based on secret key indexes, the distance to the sink node, and the energy consumption of its neighbor nodes. The experimental results show that the proposed method can increase by 50–70% the number of nodes to which authentication keys in each cluster are distributed compared to geographic and energy-aware routing (GEAR). In addition, the proposed method can detect false reports earlier by using the distributed authentication keys, and it consumes less energy than GEAR when the false traffic ratio (FTR) is ≥10%. PMID:25136649
Reviewing the Role of Cognitive Load, Expertise Level, Motivation, and Unconscious Processing in Working Memory Performance

ERIC Educational Resources Information Center

Kuldas, Seffetullah; Hashim, Shahabuddin; Ismail, Hairul Nizam; Abu Bakar, Zainudin

2015-01-01

Human cognitive capacity is unavailable for conscious processing of every amount of instructional messages. Aligning an instructional design with learner expertise level would allow better use of available working memory capacity in a cognitive learning task. Motivating students to learn consciously is also an essential determinant of the capacity…
Articulatory Suppression in Language Interpretation: Working Memory Capacity, Dual Tasking and Word Knowledge

ERIC Educational Resources Information Center

Padilla, Francisca; Bajo, Maria Teresa; Macizo, Pedro

2005-01-01

How do interpreters manage to cope with the adverse effects of concurrent articulation while trying to comprehend the message in the source language? In Experiments 1-3, we explored three possible working memory (WM) functions that may underlie the ability to simultaneously comprehend and produce in the interpreters: WM storage capacity,…
Co-creators of memory, metaphors for resilience, and mechanisms for recovery: flora in living memorials to 9/11

Treesearch

Heather L. McMillen; Lindsay K. Campbell; Erika S. Svendsen

2017-01-01

Planting trees to mark the passing of events and people is a longstanding tradition around the world. To better understand the contemporary practices of memorialization through planting, we examine living memorials created in response to the events of September 11, 2001. As an extension of the U.S. Forest Service Living Memorials Project, we reviewed existing data (n...
Distributed Parallel Processing and Dynamic Load Balancing Techniques for Multidisciplinary High Speed Aircraft Design

NASA Technical Reports Server (NTRS)

Krasteva, Denitza T.

1998-01-01

Multidisciplinary design optimization (MDO) for large-scale engineering problems poses many challenges (e.g., the design of an efficient concurrent paradigm for global optimization based on disciplinary analyses, expensive computations over vast data sets, etc.) This work focuses on the application of distributed schemes for massively parallel architectures to MDO problems, as a tool for reducing computation time and solving larger problems. The specific problem considered here is configuration optimization of a high speed civil transport (HSCT), and the efficient parallelization of the embedded paradigm for reasonable design space identification. Two distributed dynamic load balancing techniques (random polling and global round robin with message combining) and two necessary termination detection schemes (global task count and token passing) were implemented and evaluated in terms of effectiveness and scalability to large problem sizes and a thousand processors. The effect of certain parameters on execution time was also inspected. Empirical results demonstrated stable performance and effectiveness for all schemes, and the parametric study showed that the selected algorithmic parameters have a negligible effect on performance.
The relation between working memory capacity and auditory lateralization in children with auditory processing disorders.

PubMed

Moossavi, Abdollah; Mehrkian, Saiedeh; Lotfi, Yones; Faghihzadeh, Soghrat; sajedi, Hamed

2014-11-01

Auditory processing disorder (APD) describes a complex and heterogeneous disorder characterized by poor speech perception, especially in noisy environments. APD may be responsible for a range of sensory processing deficits associated with learning difficulties. There is no general consensus about the nature of APD and how the disorder should be assessed or managed. This study assessed the effect of cognition abilities (working memory capacity) on sound lateralization in children with auditory processing disorders, in order to determine how "auditory cognition" interacts with APD. The participants in this cross-sectional comparative study were 20 typically developing and 17 children with a diagnosed auditory processing disorder (9-11 years old). Sound lateralization abilities investigated using inter-aural time (ITD) differences and inter-aural intensity (IID) differences with two stimuli (high pass and low pass noise) in nine perceived positions. Working memory capacity was evaluated using the non-word repetition, and forward and backward digits span tasks. Linear regression was employed to measure the degree of association between working memory capacity and localization tests between the two groups. Children in the APD group had consistently lower scores than typically developing subjects in lateralization and working memory capacity measures. The results showed working memory capacity had significantly negative correlation with ITD errors especially with high pass noise stimulus but not with IID errors in APD children. The study highlights the impact of working memory capacity on auditory lateralization. The finding of this research indicates that the extent to which working memory influences auditory processing depend on the type of auditory processing and the nature of stimulus/listening situation. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Global synchronization algorithms for the Intel iPSC/860

NASA Technical Reports Server (NTRS)

Seidel, Steven R.; Davis, Mark A.

1992-01-01

In a distributed memory multicomputer that has no global clock, global processor synchronization can only be achieved through software. Global synchronization algorithms are used in tridiagonal systems solvers, CFD codes, sequence comparison algorithms, and sorting algorithms. They are also useful for event simulation, debugging, and for solving mutual exclusion problems. For the Intel iPSC/860 in particular, global synchronization can be used to ensure the most effective use of the communication network for operations such as the shift, where each processor in a one-dimensional array or ring concurrently sends a message to its right (or left) neighbor. Three global synchronization algorithms are considered for the iPSC/860: the gysnc() primitive provided by Intel, the PICL primitive sync0(), and a new recursive doubling synchronization (RDS) algorithm. The performance of these algorithms is compared to the performance predicted by communication models of both the long and forced message protocols. Measurements of the cost of shift operations preceded by global synchronization show that the RDS algorithm always synchronizes the nodes more precisely and costs only slightly more than the other two algorithms.

Analysis of a parallelized nonlinear elliptic boundary value problem solver with application to reacting flows

NASA Technical Reports Server (NTRS)

Keyes, David E.; Smooke, Mitchell D.

1987-01-01

A parallelized finite difference code based on the Newton method for systems of nonlinear elliptic boundary value problems in two dimensions is analyzed in terms of computational complexity and parallel efficiency. An approximate cost function depending on 15 dimensionless parameters is derived for algorithms based on stripwise and boxwise decompositions of the domain and a one-to-one assignment of the strip or box subdomains to processors. The sensitivity of the cost functions to the parameters is explored in regions of parameter space corresponding to model small-order systems with inexpensive function evaluations and also a coupled system of nineteen equations with very expensive function evaluations. The algorithm was implemented on the Intel Hypercube, and some experimental results for the model problems with stripwise decompositions are presented and compared with the theory. In the context of computational combustion problems, multiprocessors of either message-passing or shared-memory type may be employed with stripwise decompositions to realize speedup of O(n), where n is mesh resolution in one direction, for reasonable n.
Automated Instrumentation, Monitoring and Visualization of PVM Programs Using AIMS

NASA Technical Reports Server (NTRS)

Mehra, Pankaj; VanVoorst, Brian; Yan, Jerry; Lum, Henry, Jr. (Technical Monitor)

1994-01-01

We present views and analysis of the execution of several PVM (Parallel Virtual Machine) codes for Computational Fluid Dynamics on a networks of Sparcstations, including: (1) NAS Parallel Benchmarks CG and MG; (2) a multi-partitioning algorithm for NAS Parallel Benchmark SP; and (3) an overset grid flowsolver. These views and analysis were obtained using our Automated Instrumentation and Monitoring System (AIMS) version 3.0, a toolkit for debugging the performance of PVM programs. We will describe the architecture, operation and application of AIMS. The AIMS toolkit contains: (1) Xinstrument, which can automatically instrument various computational and communication constructs in message-passing parallel programs; (2) Monitor, a library of runtime trace-collection routines; (3) VK (Visual Kernel), an execution-animation tool with source-code clickback; and (4) Tally, a tool for statistical analysis of execution profiles. Currently, Xinstrument can handle C and Fortran 77 programs using PVM 3.2.x; Monitor has been implemented and tested on Sun 4 systems running SunOS 4.1.2; and VK uses XIIR5 and Motif 1.2. Data and views obtained using AIMS clearly illustrate several characteristic features of executing parallel programs on networked workstations: (1) the impact of long message latencies; (2) the impact of multiprogramming overheads and associated load imbalance; (3) cache and virtual-memory effects; and (4) significant skews between workstation clocks. Interestingly, AIMS can compensate for constant skew (zero drift) by calibrating the skew between a parent and its spawned children. In addition, AIMS' skew-compensation algorithm can adjust timestamps in a way that eliminates physically impossible communications (e.g., messages going backwards in time). Our current efforts are directed toward creating new views to explain the observed performance of PVM programs. Some of the features planned for the near future include: (1) ConfigView, showing the physical topology of the virtual machine, inferred using specially formatted IP (Internet Protocol) packets: and (2) LoadView, synchronous animation of PVM-program execution and resource-utilization patterns.
Development of a distributed control system for TOTEM experiment using ASIO Boost C++ libraries

NASA Astrophysics Data System (ADS)

Cafagna, F.; Mercadante, A.; Minafra, N.; Quinto, M.; Radicioni, E.

2014-06-01

The main goals of the TOTEM Experiment at the LHC are the measurements of the elastic and total p-p cross sections and the studies of the diffractive dissociation processes. Those scientific objectives are achieved by using three tracking detectors symmetrically arranged around the interaction point called IP5. The control system is based on a C++ software that allows the user, by means of a graphical interface, direct access to hardware and handling of devices configuration. A first release of the software was designed as a monolithic block, with all functionalities being merged together. Such approach showed soon its limits, mainly poor reusability and maintainability of the source code, evident not only in phase of bug-fixing, but also when one wants to extend functionalities or apply some other modifications. This led to the decision of a radical redesign of the software, now based on the dialogue (message-passing) among separate building blocks. Thanks to the acquired extensibility, the software gained new features and now is a complete tool by which it is possible not only to configure different devices interfacing with a large subset of buses like I2C and VME, but also to do data acquisition both for calibration and physics runs. Furthermore, the software lets the user set up a series of operations to be executed sequentially to handle complex operations. To achieve maximum flexibility, the program units may be run either as a single process or as separate processes on different PCs which exchange messages over the network, thus allowing remote control of the system. Portability is ensured by the adoption of the ASIO (Asynchronous Input Output) library of Boost, a cross-platform suite of libraries which is candidate to become part of the C++ 11 standard. We present the state of the art of this project and outline the future perspectives. In particular, we describe the system architecture and the message-passing scheme. We also report on the results obtained in a first complete test of the software both as a single process and on two PCs.
Privacy Sensitive Surveillance for Assisted Living - A Smart Camera Approach

NASA Astrophysics Data System (ADS)

Fleck, Sven; Straßer, Wolfgang

An elderly woman wanders about aimlessly in a home for assisted living. Suddenly, she collapses on the floor of a lonesome hallway. Usually it can take over two hours until a night nurse passes this spot on her next inspection round. But in this case she is already on site after two minutes, ready to help. She has received an alert message on her beeper: "Inhabitant fallen in hallway 2b". The source: the SmartSurv distributed network of smart cameras for automated and privacy respecting video analysis.Welcome to the future of smart surveillance Although this scenario is not yet daily practice, it shall make clear how such systems will impact the safety of the elderly without the privacy intrusion of traditional video surveillance systems.
Single Sided Messaging v. 0.6.6

DOE Office of Scientific and Technical Information (OSTI.GOV)

Curry, Matthew Leon; Farmer, Matthew Shane; Hassani, Amin

Single-Sided Messaging (SSM) is a portable, multitransport networking library that enables applications to leverage potential one-sided capabilities of underlying network transports. It also provides desirable semantics that services for highperformance, massively parallel computers can leverage, such as an explicit cancel operation for pending transmissions, as well as enhanced matching semantics favoring large numbers of buffers attached to a single match entry. This release supports TCP/IP, shared memory, and Infiniband.
Accumulative Roll Bonding and Post-Deformation Annealing of Cu-Al-Mn Shape Memory Alloy

NASA Astrophysics Data System (ADS)

Moghaddam, Ahmad Ostovari; Ketabchi, Mostafa; Afrasiabi, Yaser

2014-12-01

Accumulative roll bonding is a severe plastic deformation process used for Cu-Al-Mn shape memory alloy. The main purpose of this study is to investigate the possibility of grain refinement of Cu-9.5Al-8.2Mn (in wt.%) shape memory alloy using accumulative roll bonding and post-deformation annealing. The alloy was successfully subjected to 5 passes of accumulative roll bonding at 600 °C. The microstructure, properties as well as post-deformation annealing of this alloy were investigated by optical microscopy, scanning electron microscopy, x-ray diffraction, differential scanning calorimeter, and bend and tensile testing. The results showed that after 5 passes of ARB at 600 °C, specimens possessed α + β microstructure with the refined grains, but martensite phases and consequently shape memory effect completely disappeared. Post-deformation annealing was carried out at 700 °C, and the martensite phase with the smallest grain size (less than 40 μm) was obtained after 150 s of annealing at 700 °C. It was found that after 5 passes of ARB and post-deformation annealing, the stability of SME during thermal cycling improved. Also, tensile properties of alloys significantly improved after post-deformation annealing.
Message passing with queues and channels

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dozsa, Gabor J; Heidelberger, Philip; Kumar, Sameer

In an embodiment, a send thread receives an identifier that identifies a destination node and a pointer to data. The send thread creates a first send request in response to the receipt of the identifier and the data pointer. The send thread selects a selected channel from among a plurality of channels. The selected channel comprises a selected hand-off queue and an identification of a selected message unit. Each of the channels identifies a different message unit. The selected hand-off queue is randomly accessible. If the selected hand-off queue contains an available entry, the send thread adds the first sendmore » request to the selected hand-off queue. If the selected hand-off queue does not contain an available entry, the send thread removes a second send request from the selected hand-off queue and sends the second send request to the selected message unit.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Hansen, Timothy M.; Palmintier, Bryan; Suryanarayanan, Siddharth

As more Smart Grid technologies (e.g., distributed photovoltaic, spatially distributed electric vehicle charging) are integrated into distribution grids, static distribution simulations are no longer sufficient for performing modeling and analysis. GridLAB-D is an agent-based distribution system simulation environment that allows fine-grained end-user models, including geospatial and network topology detail. A problem exists in that, without outside intervention, once the GridLAB-D simulation begins execution, it will run to completion without allowing the real-time interaction of Smart Grid controls, such as home energy management systems and aggregator control. We address this lack of runtime interaction by designing a flexible communication interface, Bus.pymore » (pronounced bus-dot-pie), that uses Python to pass messages between one or more GridLAB-D instances and a Smart Grid simulator. This work describes the design and implementation of Bus.py, discusses its usefulness in terms of some Smart Grid scenarios, and provides an example of an aggregator-based residential demand response system interacting with GridLAB-D through Bus.py. The small scale example demonstrates the validity of the interface and shows that an aggregator using said interface is able to control residential loads in GridLAB-D during runtime to cause a reduction in the peak load on the distribution system in (a) peak reduction and (b) time-of-use pricing cases.« less
The BaBar Data Reconstruction Control System

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ceseracciu, A

2005-04-20

The BaBar experiment is characterized by extremely high luminosity and very large volume of data produced and stored, with increasing computing requirements each year. To fulfill these requirements a Control System has been designed and developed for the offline distributed data reconstruction system. The control system described in this paper provides the performance and flexibility needed to manage a large number of small computing farms, and takes full benefit of OO design. The infrastructure is well isolated from the processing layer, it is generic and flexible, based on a light framework providing message passing and cooperative multitasking. The system ismore » distributed in a hierarchical way: the top-level system is organized in farms, farms in services, and services in subservices or code modules. It provides a powerful Finite State Machine framework to describe custom processing models in a simple regular language. This paper describes the design and evolution of this control system, currently in use at SLAC and Padova on {approx}450 CPUs organized in 9 farms.« less
The BaBar Data Reconstruction Control System

NASA Astrophysics Data System (ADS)

Ceseracciu, A.; Piemontese, M.; Tehrani, F. S.; Pulliam, T. M.; Galeazzi, F.

2005-08-01

The BaBar experiment is characterized by extremely high luminosity and very large volume of data produced and stored, with increasing computing requirements each year. To fulfill these requirements a control system has been designed and developed for the offline distributed data reconstruction system. The control system described in this paper provides the performance and flexibility needed to manage a large number of small computing farms, and takes full benefit of object oriented (OO) design. The infrastructure is well isolated from the processing layer, it is generic and flexible, based on a light framework providing message passing and cooperative multitasking. The system is distributed in a hierarchical way: the top-level system is organized in farms, farms in services, and services in subservices or code modules. It provides a powerful finite state machine framework to describe custom processing models in a simple regular language. This paper describes the design and evolution of this control system, currently in use at SLAC and Padova on /spl sim/450 CPUs organized in nine farms.
Scalable Optical-Fiber Communication Networks

NASA Technical Reports Server (NTRS)

Chow, Edward T.; Peterson, John C.

1993-01-01

Scalable arbitrary fiber extension network (SAFEnet) is conceptual fiber-optic communication network passing digital signals among variety of computers and input/output devices at rates from 200 Mb/s to more than 100 Gb/s. Intended for use with very-high-speed computers and other data-processing and communication systems in which message-passing delays must be kept short. Inherent flexibility makes it possible to match performance of network to computers by optimizing configuration of interconnections. In addition, interconnections made redundant to provide tolerance to faults.
Domestic Minor Sex Trafficking Deterrence and Victims Support Act of 2010

THOMAS, 111th Congress

Sen. Wyden, Ron [D-OR

2009-12-22

Senate - 12/21/2010 Message on House action received in Senate and at desk: House amendment to Senate bill. (All Actions) Tracker: This bill has the status Passed HouseHere are the steps for Status of Legislation:
Xyce Parallel Electronic Simulator Users' Guide Version 6.7.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R.; Aadithya, Karthik Venkatraman; Mei, Ting

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one tomore » develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The information herein is subject to change without notice. Copyright c 2002-2017 Sandia Corporation. All rights reserved. Trademarks Xyce TM Electronic Simulator and Xyce TM are trademarks of Sandia Corporation. Orcad, Orcad Capture, PSpice and Probe are registered trademarks of Cadence Design Systems, Inc. Microsoft, Windows and Windows 7 are registered trademarks of Microsoft Corporation. Medici, DaVinci and Taurus are registered trademarks of Synopsys Corporation. Amtec and TecPlot are trademarks of Amtec Engineering, Inc. All other trademarks are property of their respective owners. Contacts World Wide Web http://xyce.sandia.gov https://info.sandia.gov/xyce (Sandia only) Email xyce@sandia.gov (outside Sandia) xyce-sandia@sandia.gov (Sandia only) Bug Reports (Sandia only) http://joseki-vm.sandia.gov/bugzilla http://morannon.sandia.gov/bugzilla« less
Chaos synchronization communication using extremely unsymmetrical bidirectional injections.

PubMed

Zhang, Wei Li; Pan, Wei; Luo, Bin; Zou, Xi Hua; Wang, Meng Yao; Zhou, Zhi

2008-02-01

Chaos synchronization and message transmission between two semiconductor lasers with extremely unsymmetrical bidirectional injections (EUBIs) are discussed. By using EUBIs, synchronization is realized through injection locking. Numerical results show that if the laser subjected to strong injection serves as the receiver, chaos pass filtering (CPF) of the system is similar to that of unidirectional coupled systems. Moreover, if the other laser serves as the receiver, a stronger CPF can be obtained. Finally, we demonstrate that messages can be extracted successfully from either of the two transmission directions of the system.
Automation Framework for Flight Dynamics Products Generation

NASA Technical Reports Server (NTRS)

Wiegand, Robert E.; Esposito, Timothy C.; Watson, John S.; Jun, Linda; Shoan, Wendy; Matusow, Carla

2010-01-01

XFDS provides an easily adaptable automation platform. To date it has been used to support flight dynamics operations. It coordinates the execution of other applications such as Satellite TookKit, FreeFlyer, MATLAB, and Perl code. It provides a mechanism for passing messages among a collection of XFDS processes, and allows sending and receiving of GMSEC messages. A unified and consistent graphical user interface (GUI) is used for the various tools. Its automation configuration is stored in text files, and can be edited either directly or using the GUI.
The effects of frame, appeal, and outcome extremity of antismoking messages on cognitive processing.

PubMed

Leshner, Glenn; Cheng, I-Huei

2009-04-01

Research on the impact of antismoking advertisements in countermarketing cigarette advertising is equivocal. Although many studies examined how different message appeal types influence people's attitudes and behavior, there have been few studies that have explored the mechanism of how individuals attend to and remember antismoking information. This study examined how message attributes of antismoking TV ads (frame, appeal type, and outcome extremity) interacted to influence people's attention (secondary task reaction time) and memory (recognition). Antismoking public service announcements were chosen that were either loss- or gain-framed, had either a health or social appeal, or had either a more or less extreme outcome described in the message. Among the key findings were that loss-framed messages with more extreme outcomes required the most processing resources (i.e., had the slowest secondary task reaction times) and were the best remembered (i.e., were best recognized). These findings indicate ways that different message attributes affect individuals' cognitive processing, and they are discussed in light of prior framing and persuasion research.
Replenishing data descriptors in a DMA injection FIFO buffer

DOEpatents

Archer, Charles J [Rochester, MN; Blocksome, Michael A [Rochester, MN; Cernohous, Bob R [Rochester, MN; Heidelberger, Philip [Cortlandt Manor, NY; Kumar, Sameer [White Plains, NY; Parker, Jeffrey J [Rochester, MN

2011-10-11

Methods, apparatus, and products are disclosed for replenishing data descriptors in a Direct Memory Access (`DMA`) injection first-in-first-out (`FIFO`) buffer that include: determining, by a messaging module on an origin compute node, whether a number of data descriptors in a DMA injection FIFO buffer exceeds a predetermined threshold, each data descriptor specifying an application message for transmission to a target compute node; queuing, by the messaging module, a plurality of new data descriptors in a pending descriptor queue if the number of the data descriptors in the DMA injection FIFO buffer exceeds the predetermined threshold; establishing, by the messaging module, interrupt criteria that specify when to replenish the injection FIFO buffer with the plurality of new data descriptors in the pending descriptor queue; and injecting, by the messaging module, the plurality of new data descriptors into the injection FIFO buffer in dependence upon the interrupt criteria.
"For whom the bell tolls": emotional rubbernecking in Facebook memorial groups.

PubMed

DeGroot, Jocelyn M

2014-01-01

Facebook memorial groups are often formed as a way for people to remember a deceased loved one. Because of the public nature of communication on Facebook, people who did not intimately know the deceased (emotional rubberneckers) can locate memorial groups and watch as people grieve the loss of their friend or family member. Using grounded theory methods, the author identified and examined the function of the rubberneckers' messages posted on 10 Facebook memorial group walls. Emotional rubberneckers identified with the deceased and expressed sadness at their death, indicating a connection with the deceased stranger.
Generation-based memory synchronization in a multiprocessor system with weakly consistent memory accesses

DOEpatents

Ohmacht, Martin

2017-08-15

In a multiprocessor system, a central memory synchronization module coordinates memory synchronization requests responsive to memory access requests in flight, a generation counter, and a reclaim pointer. The central module communicates via point-to-point communication. The module includes a global OR reduce tree for each memory access requesting device, for detecting memory access requests in flight. An interface unit is implemented associated with each processor requesting synchronization. The interface unit includes multiple generation completion detectors. The generation count and reclaim pointer do not pass one another.
Generation-based memory synchronization in a multiprocessor system with weakly consistent memory accesses

DOEpatents

Ohmacht, Martin

2014-09-09

In a multiprocessor system, a central memory synchronization module coordinates memory synchronization requests responsive to memory access requests in flight, a generation counter, and a reclaim pointer. The central module communicates via point-to-point communication. The module includes a global OR reduce tree for each memory access requesting device, for detecting memory access requests in flight. An interface unit is implemented associated with each processor requesting synchronization. The interface unit includes multiple generation completion detectors. The generation count and reclaim pointer do not pass one another.

Disaster Relief Appropriations Act, 2013

THOMAS, 112th Congress

Rep. Rogers, Harold [R-KY-5

2011-02-11

Senate - 12/30/2012 Message on Senate action sent to the House. (All Actions) Notes: This bill was for a time the Senate legislative vehicle for Hurricane Sandy supplemental appropriations. Tracker: This bill has the status Passed SenateHere are the steps for Status of Legislation:
Asynchronous Communication Scheme For Hypercube Computer

NASA Technical Reports Server (NTRS)

Madan, Herb S.

1988-01-01

Scheme devised for asynchronous-message communication system for Mark III hypercube concurrent-processor network. Network consists of up to 1,024 processing elements connected electrically as though were at corners of 10-dimensional cube. Each node contains two Motorola 68020 processors along with Motorola 68881 floating-point processor utilizing up to 4 megabytes of shared dynamic random-access memory. Scheme intended to support applications requiring passage of both polled or solicited and unsolicited messages.
Decentralized Formation Flying Control in a Multiple-Team Hierarchy

NASA Technical Reports Server (NTRS)

Mueller, Joseph .; Thomas, Stephanie J.

2005-01-01

This paper presents the prototype of a system that addresses these objectives-a decentralized guidance and control system that is distributed across spacecraft using a multiple-team framework. The objective is to divide large clusters into teams of manageable size, so that the communication and computational demands driven by N decentralized units are related to the number of satellites in a team rather than the entire cluster. The system is designed to provide a high-level of autonomy, to support clusters with large numbers of satellites, to enable the number of spacecraft in the cluster to change post-launch, and to provide for on-orbit software modification. The distributed guidance and control system will be implemented in an object-oriented style using MANTA (Messaging Architecture for Networking and Threaded Applications). In this architecture, tasks may be remotely added, removed or replaced post-launch to increase mission flexibility and robustness. This built-in adaptability will allow software modifications to be made on-orbit in a robust manner. The prototype system, which is implemented in MATLAB, emulates the object-oriented and message-passing features of the MANTA software. In this paper, the multiple-team organization of the cluster is described, and the modular software architecture is presented. The relative dynamics in eccentric reference orbits is reviewed, and families of periodic, relative trajectories are identified, expressed as sets of static geometric parameters. The guidance law design is presented, and an example reconfiguration scenario is used to illustrate the distributed process of assigning geometric goals to the cluster. Next, a decentralized maneuver planning approach is presented that utilizes linear-programming methods to enact reconfiguration and coarse formation keeping maneuvers. Finally, a method for performing online collision avoidance is discussed, and an example is provided to gauge its performance.
Interactive Supercomputing’s Star-P Platform

DOE Office of Scientific and Technical Information (OSTI.GOV)

Edelman, Alan; Husbands, Parry; Leibman, Steve

2006-09-19

The thesis of this extended abstract is simple. High productivity comes from high level infrastructures. To measure this, we introduce a methodology that goes beyond the tradition of timing software in serial and tuned parallel modes. We perform a classroom productivity study involving 29 students who have written a homework exercise in a low level language (MPI message passing) and a high level language (Star-P with MATLAB client). Our conclusions indicate what perhaps should be of little surprise: (1) the high level language is always far easier on the students than the low level language. (2) The early versions ofmore » the high level language perform inadequately compared to the tuned low level language, but later versions substantially catch up. Asymptotically, the analogy must hold that message passing is to high level language parallel programming as assembler is to high level environments such as MATLAB, Mathematica, Maple, or even Python. We follow the Kepner method that correctly realizes that traditional speedup numbers without some discussion of the human cost of reaching these numbers can fail to reflect the true human productivity cost of high performance computing. Traditional data compares low level message passing with serial computation. With the benefit of a high level language system in place, in our case Star-P running with MATLAB client, and with the benefit of a large data pool: 29 students, each running the same code ten times on three evolutions of the same platform, we can methodically demonstrate the productivity gains. To date we are not aware of any high level system as extensive and interoperable as Star-P, nor are we aware of an experiment of this kind performed with this volume of data.« less
Applications Performance Under MPL and MPI on NAS IBM SP2

NASA Technical Reports Server (NTRS)

Saini, Subhash; Simon, Horst D.; Lasinski, T. A. (Technical Monitor)

1994-01-01

On July 5, 1994, an IBM Scalable POWER parallel System (IBM SP2) with 64 nodes, was installed at the Numerical Aerodynamic Simulation (NAS) Facility Each node of NAS IBM SP2 is a "wide node" consisting of a RISC 6000/590 workstation module with a clock of 66.5 MHz which can perform four floating point operations per clock with a peak performance of 266 Mflop/s. By the end of 1994, 64 nodes of IBM SP2 will be upgraded to 160 nodes with a peak performance of 42.5 Gflop/s. An overview of the IBM SP2 hardware is presented. The basic understanding of architectural details of RS 6000/590 will help application scientists the porting, optimizing, and tuning of codes from other machines such as the CRAY C90 and the Paragon to the NAS SP2. Optimization techniques such as quad-word loading, effective utilization of two floating point units, and data cache optimization of RS 6000/590 is illustrated, with examples giving performance gains at each optimization step. The conversion of codes using Intel's message passing library NX to codes using native Message Passing Library (MPL) and the Message Passing Interface (NMI) library available on the IBM SP2 is illustrated. In particular, we will present the performance of Fast Fourier Transform (FFT) kernel from NAS Parallel Benchmarks (NPB) under MPL and MPI. We have also optimized some of Fortran BLAS 2 and BLAS 3 routines, e.g., the optimized Fortran DAXPY runs at 175 Mflop/s and optimized Fortran DGEMM runs at 230 Mflop/s per node. The performance of the NPB (Class B) on the IBM SP2 is compared with the CRAY C90, Intel Paragon, TMC CM-5E, and the CRAY T3D.
On a model of three-dimensional bursting and its parallel implementation

NASA Astrophysics Data System (ADS)

Tabik, S.; Romero, L. F.; Garzón, E. M.; Ramos, J. I.

2008-04-01

A mathematical model for the simulation of three-dimensional bursting phenomena and its parallel implementation are presented. The model consists of four nonlinearly coupled partial differential equations that include fast and slow variables, and exhibits bursting in the absence of diffusion. The differential equations have been discretized by means of a second-order accurate in both space and time, linearly-implicit finite difference method in equally-spaced grids. The resulting system of linear algebraic equations at each time level has been solved by means of the Preconditioned Conjugate Gradient (PCG) method. Three different parallel implementations of the proposed mathematical model have been developed; two of these implementations, i.e., the MPI and the PETSc codes, are based on a message passing paradigm, while the third one, i.e., the OpenMP code, is based on a shared space address paradigm. These three implementations are evaluated on two current high performance parallel architectures, i.e., a dual-processor cluster and a Shared Distributed Memory (SDM) system. A novel representation of the results that emphasizes the most relevant factors that affect the performance of the paralled implementations, is proposed. The comparative analysis of the computational results shows that the MPI and the OpenMP implementations are about twice more efficient than the PETSc code on the SDM system. It is also shown that, for the conditions reported here, the nonlinear dynamics of the three-dimensional bursting phenomena exhibits three stages characterized by asynchronous, synchronous and then asynchronous oscillations, before a quiescent state is reached. It is also shown that the fast system reaches steady state in much less time than the slow variables.
3D streamers simulation in a pin to plane configuration using massively parallel computing

NASA Astrophysics Data System (ADS)

Plewa, J.-M.; Eichwald, O.; Ducasse, O.; Dessante, P.; Jacobs, C.; Renon, N.; Yousfi, M.

2018-03-01

This paper concerns the 3D simulation of corona discharge using high performance computing (HPC) managed with the message passing interface (MPI) library. In the field of finite volume methods applied on non-adaptive mesh grids and in the case of a specific 3D dynamic benchmark test devoted to streamer studies, the great efficiency of the iterative R&B SOR and BiCGSTAB methods versus the direct MUMPS method was clearly demonstrated in solving the Poisson equation using HPC resources. The optimization of the parallelization and the resulting scalability was undertaken as a function of the HPC architecture for a number of mesh cells ranging from 8 to 512 million and a number of cores ranging from 20 to 1600. The R&B SOR method remains at least about four times faster than the BiCGSTAB method and requires significantly less memory for all tested situations. The R&B SOR method was then implemented in a 3D MPI parallelized code that solves the classical first order model of an atmospheric pressure corona discharge in air. The 3D code capabilities were tested by following the development of one, two and four coplanar streamers generated by initial plasma spots for 6 ns. The preliminary results obtained allowed us to follow in detail the formation of the tree structure of a corona discharge and the effects of the mutual interactions between the streamers in terms of streamer velocity, trajectory and diameter. The computing time for 64 million of mesh cells distributed over 1000 cores using the MPI procedures is about 30 min ns-1, regardless of the number of streamers.
Implementation and Characterization of Three-Dimensional Particle-in-Cell Codes on Multiple-Instruction-Multiple-Data Massively Parallel Supercomputers

NASA Technical Reports Server (NTRS)

Lyster, P. M.; Liewer, P. C.; Decyk, V. K.; Ferraro, R. D.

1995-01-01

A three-dimensional electrostatic particle-in-cell (PIC) plasma simulation code has been developed on coarse-grain distributed-memory massively parallel computers with message passing communications. Our implementation is the generalization to three-dimensions of the general concurrent particle-in-cell (GCPIC) algorithm. In the GCPIC algorithm, the particle computation is divided among the processors using a domain decomposition of the simulation domain. In a three-dimensional simulation, the domain can be partitioned into one-, two-, or three-dimensional subdomains ("slabs," "rods," or "cubes") and we investigate the efficiency of the parallel implementation of the push for all three choices. The present implementation runs on the Intel Touchstone Delta machine at Caltech; a multiple-instruction-multiple-data (MIMD) parallel computer with 512 nodes. We find that the parallel efficiency of the push is very high, with the ratio of communication to computation time in the range 0.3%-10.0%. The highest efficiency (> 99%) occurs for a large, scaled problem with 64(sup 3) particles per processing node (approximately 134 million particles of 512 nodes) which has a push time of about 250 ns per particle per time step. We have also developed expressions for the timing of the code which are a function of both code parameters (number of grid points, particles, etc.) and machine-dependent parameters (effective FLOP rate, and the effective interprocessor bandwidths for the communication of particles and grid points). These expressions can be used to estimate the performance of scaled problems--including those with inhomogeneous plasmas--to other parallel machines once the machine-dependent parameters are known.
Computational study of scattering of a zero-order Bessel beam by large nonspherical homogeneous particles with the multilevel fast multipole algorithm

NASA Astrophysics Data System (ADS)

Yang, Minglin; Wu, Yueqian; Sheng, Xinqing; Ren, Kuan Fang

2017-12-01

Computation of scattering of shaped beams by large nonspherical particles is a challenge in both optics and electromagnetics domains since it concerns many research fields. In this paper, we report our new progress in the numerical computation of the scattering diagrams. Our algorithm permits to calculate the scattering of a particle of size as large as 110 wavelengths or 700 in size parameter. The particle can be transparent or absorbing of arbitrary shape, smooth or with a sharp surface, such as the Chebyshev particles or ice crystals. To illustrate the capacity of the algorithm, a zero order Bessel beam is taken as the incident beam, and the scattering of ellipsoidal particles and Chebyshev particles are taken as examples. Some special phenomena have been revealed and examined. The scattering problem is formulated with the combined tangential formulation and solved iteratively with the aid of the multilevel fast multipole algorithm, which is well parallelized with the message passing interface on the distributed memory computer platform using the hybrid partitioning strategy. The numerical predictions are compared with the results of the rigorous method for a spherical particle to validate the accuracy of the approach. The scattering diagrams of large ellipsoidal particles with various parameters are examined. The effect of aspect ratios, as well as half-cone angle of the incident zero-order Bessel beam and the off-axis distance on scattered intensity, is studied. Scattering by asymmetry Chebyshev particle with size parameter larger than 700 is also given to show the capability of the method for computing scattering by arbitrary shaped particles.
Interval timing in genetically modified mice: a simple paradigm

PubMed Central

Balci, F.; Papachristos, E. B.; Gallistel, C. R.; Brunner, D.; Gibson, J.; Shumyatsky, G. P.

2009-01-01

We describe a behavioral screen for the quantitative study of interval timing and interval memory in mice. Mice learn to switch from a short-latency feeding station to a long-latency station when the short latency has passed without a feeding. The psychometric function is the cumulative distribution of switch latencies. Its median measures timing accuracy and its interquartile interval measures timing precision. Next, using this behavioral paradigm, we have examined mice with a gene knockout of the receptor for gastrin-releasing peptide that show enhanced (i.e. prolonged) freezing in fear conditioning. We have tested the hypothesis that the mutants freeze longer because they are more uncertain than wild types about when to expect the electric shock. The knockouts however show normal accuracy and precision in timing, so we have rejected this alternative hypothesis. Last, we conduct the pharmacological validation of our behavioral screen using D-amphetamine and methamphetamine. We suggest including the analysis of interval timing and temporal memory in tests of genetically modified mice for learning and memory and argue that our paradigm allows this to be done simply and efficiently. PMID:17696995
Interval timing in genetically modified mice: a simple paradigm.

PubMed

Balci, F; Papachristos, E B; Gallistel, C R; Brunner, D; Gibson, J; Shumyatsky, G P

2008-04-01

We describe a behavioral screen for the quantitative study of interval timing and interval memory in mice. Mice learn to switch from a short-latency feeding station to a long-latency station when the short latency has passed without a feeding. The psychometric function is the cumulative distribution of switch latencies. Its median measures timing accuracy and its interquartile interval measures timing precision. Next, using this behavioral paradigm, we have examined mice with a gene knockout of the receptor for gastrin-releasing peptide that show enhanced (i.e. prolonged) freezing in fear conditioning. We have tested the hypothesis that the mutants freeze longer because they are more uncertain than wild types about when to expect the electric shock. The knockouts however show normal accuracy and precision in timing, so we have rejected this alternative hypothesis. Last, we conduct the pharmacological validation of our behavioral screen using d-amphetamine and methamphetamine. We suggest including the analysis of interval timing and temporal memory in tests of genetically modified mice for learning and memory and argue that our paradigm allows this to be done simply and efficiently.
Team performance in networked supervisory control of unmanned air vehicles: effects of automation, working memory, and communication content.

PubMed

McKendrick, Ryan; Shaw, Tyler; de Visser, Ewart; Saqer, Haneen; Kidwell, Brian; Parasuraman, Raja

2014-05-01

Assess team performance within a net-worked supervisory control setting while manipulating automated decision aids and monitoring team communication and working memory ability. Networked systems such as multi-unmanned air vehicle (UAV) supervision have complex properties that make prediction of human-system performance difficult. Automated decision aid can provide valuable information to operators, individual abilities can limit or facilitate team performance, and team communication patterns can alter how effectively individuals work together. We hypothesized that reliable automation, higher working memory capacity, and increased communication rates of task-relevant information would offset performance decrements attributed to high task load. Two-person teams performed a simulated air defense task with two levels of task load and three levels of automated aid reliability. Teams communicated and received decision aid messages via chat window text messages. Task Load x Automation effects were significant across all performance measures. Reliable automation limited the decline in team performance with increasing task load. Average team spatial working memory was a stronger predictor than other measures of team working memory. Frequency of team rapport and enemy location communications positively related to team performance, and word count was negatively related to team performance. Reliable decision aiding mitigated team performance decline during increased task load during multi-UAV supervisory control. Team spatial working memory, communication of spatial information, and team rapport predicted team success. An automated decision aid can improve team performance under high task load. Assessment of spatial working memory and the communication of task-relevant information can help in operator and team selection in supervisory control systems.
Autoplan: A self-processing network model for an extended blocks world planning environment

NASA Technical Reports Server (NTRS)

Dautrechy, C. Lynne; Reggia, James A.; Mcfadden, Frank

1990-01-01

Self-processing network models (neural/connectionist models, marker passing/message passing networks, etc.) are currently undergoing intense investigation for a variety of information processing applications. These models are potentially very powerful in that they support a large amount of explicit parallel processing, and they cleanly integrate high level and low level information processing. However they are currently limited by a lack of understanding of how to apply them effectively in many application areas. The formulation of self-processing network methods for dynamic, reactive planning is studied. The long-term goal is to formulate robust, computationally effective information processing methods for the distributed control of semiautonomous exploration systems, e.g., the Mars Rover. The current research effort is focusing on hierarchical plan generation, execution and revision through local operations in an extended blocks world environment. This scenario involves many challenging features that would be encountered in a real planning and control environment: multiple simultaneous goals, parallel as well as sequential action execution, action sequencing determined not only by goals and their interactions but also by limited resources (e.g., three tasks, two acting agents), need to interpret unanticipated events and react appropriately through replanning, etc.
Mission Services Evolution Center Message Bus

NASA Technical Reports Server (NTRS)

Mayorga, Arturo; Bristow, John O.; Butschky, Mike

2011-01-01

The Goddard Mission Services Evolution Center (GMSEC) Message Bus is a robust, lightweight, fault-tolerant middleware implementation that supports all messaging capabilities of the GMSEC API. This architecture is a distributed software system that routes messages based on message subject names and knowledge of the locations in the network of the interested software components.
Enabling Energy Saving Innovations Act

THOMAS, 112th Congress

Rep. Aderholt, Robert B. [R-AL-4

2012-04-26

Senate - 09/24/2012 Message on Senate action sent to the House. (All Actions) Notes: For further action, see H.R.6582, which became Public Law 112-210 on 12/18/2012. Tracker: This bill has the status Passed SenateHere are the steps for Status of Legislation:
A Model for Speedup of Parallel Programs

DTIC Science & Technology

1997-01-01

Sanjeev. K Setia . The interaction between mem- ory allocation and adaptive partitioning in message- passing multicomputers. In IPPS 󈨣 Workshop on Job...Scheduling Strategies for Parallel Processing, pages 89{99, 1995. [15] Sanjeev K. Setia and Satish K. Tripathi. A compar- ative analysis of static
Interstate Recognition of Notarizations Act of 2010

THOMAS, 111th Congress

Rep. Aderholt, Robert B. [R-AL-4

2009-10-14

House - 11/17/2010 On motion to refer the bill and the accompanying veto message to the Committee on the Judiciary. Agreed to without objection. (All Actions) Tracker: This bill has the status Failed to pass over vetoHere are the steps for Status of Legislation:
Approximate message passing for nonconvex sparse regularization with stability and asymptotic analysis

NASA Astrophysics Data System (ADS)

Sakata, Ayaka; Xu, Yingying

2018-03-01

We analyse a linear regression problem with nonconvex regularization called smoothly clipped absolute deviation (SCAD) under an overcomplete Gaussian basis for Gaussian random data. We propose an approximate message passing (AMP) algorithm considering nonconvex regularization, namely SCAD-AMP, and analytically show that the stability condition corresponds to the de Almeida-Thouless condition in spin glass literature. Through asymptotic analysis, we show the correspondence between the density evolution of SCAD-AMP and the replica symmetric (RS) solution. Numerical experiments confirm that for a sufficiently large system size, SCAD-AMP achieves the optimal performance predicted by the replica method. Through replica analysis, a phase transition between replica symmetric and replica symmetry breaking (RSB) region is found in the parameter space of SCAD. The appearance of the RS region for a nonconvex penalty is a significant advantage that indicates the region of smooth landscape of the optimization problem. Furthermore, we analytically show that the statistical representation performance of the SCAD penalty is better than that of \
Validity of the Wechsler Test of Adult Reading (WTAR): effort considered in a clinical sample of U.S. military veterans.

PubMed

Whitney, Kriscinda A; Shepard, Polly H; Mariner, Jennifer; Mossbarger, Brad; Herman, Steven M

2010-07-01

The current study represents an examination of the construct validity of the Wechsler Test of Adult Reading (WTAR) among a sample of U.S. military veterans referred for outpatient neuropsychological evaluation that included a measure of negative response bias, namely, the Test of Memory Malingering (TOMM). This retrospective data analysis examined the relationship between the WTAR and measures of current verbal general intellectual function and current cognitive skills. Findings showed that, among patients passing the TOMM (N = 98), WTAR scores were most highly correlated with current verbal IQ but also showed significant correlations with verbal memory and lesser, but still significant, correlations with measures of visual-spatial memory. Discriminant validity for the WTAR was also shown among the group passing the TOMM in the sense that the WTAR, which is designed to measure verbal premorbid general intellectual skill, was not as highly correlated with measures of learning and memory as was a measure of current verbal general intellectual skill. Whereas scores on most study measures did significantly differ between the groups that passed versus failed the TOMM (N = 26), scores on the WTAR did not, suggesting that the WTAR may remain robust even in the face of suboptimal effort.
In memory of Natacha Betz

NASA Astrophysics Data System (ADS)

Bouffard, Serge

2018-01-01

Natacha Betz has participated to the creation of the IRAP conference series and co-organized three of them. Unfortunately, ten years ago, she passed away. The organization of IRAP 2016 in France gives us the opportunity to pay a tribute to her memory.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.