distributed-shared memory programming: Topics by Science.gov

Sample records for distributed-shared memory programming

Supporting shared data structures on distributed memory architectures

NASA Technical Reports Server (NTRS)

Koelbel, Charles; Mehrotra, Piyush; Vanrosendale, John

1990-01-01

Programming nonshared memory systems is more difficult than programming shared memory systems, since there is no support for shared data structures. Current programming languages for distributed memory architectures force the user to decompose all data structures into separate pieces, with each piece owned by one of the processors in the machine, and with all communication explicitly specified by low-level message-passing primitives. A new programming environment is presented for distributed memory architectures, providing a global name space and allowing direct access to remote parts of data values. The analysis and program transformations required to implement this environment are described, and the efficiency of the resulting code on the NCUBE/7 and IPSC/2 hypercubes are described.
Effects of cacheing on multitasking efficiency and programming strategy on an ELXSI 6400

DOE Office of Scientific and Technical Information (OSTI.GOV)

Montry, G.R.; Benner, R.E.

1985-12-01

The impact of a cache/shared memory architecture, and, in particular, the cache coherency problem, upon concurrent algorithm and program development is discussed. In this context, a simple set of programming strategies are proposed which streamline code development and improve code performance when multitasking in a cache/shared memory or distributed memory environment.
Shared versus distributed memory multiprocessors

NASA Technical Reports Server (NTRS)

Jordan, Harry F.

1991-01-01

The question of whether multiprocessors should have shared or distributed memory has attracted a great deal of attention. Some researchers argue strongly for building distributed memory machines, while others argue just as strongly for programming shared memory multiprocessors. A great deal of research is underway on both types of parallel systems. Special emphasis is placed on systems with a very large number of processors for computation intensive tasks and considers research and implementation trends. It appears that the two types of systems will likely converge to a common form for large scale multiprocessors.
Comparison of two paradigms for distributed shared memory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Levelt, W.G.; Kaashoek, M.F.; Bal, H.E.

1990-08-01

The paper compares two paradigms for Distributed Shared Memory on loosely coupled computing systems: the shared data-object model as used in Orca, a programming language specially designed for loosely coupled computing systems and the Shared Virtual Memory model. For both paradigms the authors have implemented two systems, one using only point-to-point messages, the other using broadcasting as well. They briefly describe these two paradigms and their implementations. Then they compare their performance on four applications: the traveling salesman problem, alpha-beta search, matrix multiplication and the all pairs shortest paths problem. The measurements show that both paradigms can be used efficientlymore » for programming large-grain parallel applications. Significant speedups were obtained on all applications. The unstructured Shared Virtual Memory paradigm achieves the best absolute performance, although this is largely due to the preliminary nature of the Orca compiler used. The structured shared data-object model achieves the highest speedups and is much easier to program and to debug.« less
Combining Distributed and Shared Memory Models: Approach and Evolution of the Global Arrays Toolkit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nieplocha, Jarek; Harrison, Robert J.; Kumar, Mukul

2002-07-29

Both shared memory and distributed memory models have advantages and shortcomings. Shared memory model is much easier to use but it ignores data locality/placement. Given the hierarchical nature of the memory subsystems in the modern computers this characteristic might have a negative impact on performance and scalability. Various techniques, such as code restructuring to increase data reuse and introducing blocking in data accesses, can address the problem and yield performance competitive with message passing[Singh], however at the cost of compromising the ease of use feature. Distributed memory models such as message passing or one-sided communication offer performance and scalability butmore » they compromise the ease-of-use. In this context, the message-passing model is sometimes referred to as?assembly programming for the scientific computing?. The Global Arrays toolkit[GA1, GA2] attempts to offer the best features of both models. It implements a shared-memory programming model in which data locality is managed explicitly by the programmer. This management is achieved by explicit calls to functions that transfer data between a global address space (a distributed array) and local storage. In this respect, the GA model has similarities to the distributed shared-memory models that provide an explicit acquire/release protocol. However, the GA model acknowledges that remote data is slower to access than local data and allows data locality to be explicitly specified and hence managed. The GA model exposes to the programmer the hierarchical memory of modern high-performance computer systems, and by recognizing the communication overhead for remote data transfer, it promotes data reuse and locality of reference. This paper describes the characteristics of the Global Arrays programming model, capabilities of the toolkit, and discusses its evolution.« less
Performance Modeling and Measurement of Parallelized Code for Distributed Shared Memory Multiprocessors

NASA Technical Reports Server (NTRS)

Waheed, Abdul; Yan, Jerry

1998-01-01

This paper presents a model to evaluate the performance and overhead of parallelizing sequential code using compiler directives for multiprocessing on distributed shared memory (DSM) systems. With increasing popularity of shared address space architectures, it is essential to understand their performance impact on programs that benefit from shared memory multiprocessing. We present a simple model to characterize the performance of programs that are parallelized using compiler directives for shared memory multiprocessing. We parallelized the sequential implementation of NAS benchmarks using native Fortran77 compiler directives for an Origin2000, which is a DSM system based on a cache-coherent Non Uniform Memory Access (ccNUMA) architecture. We report measurement based performance of these parallelized benchmarks from four perspectives: efficacy of parallelization process; scalability; parallelization overhead; and comparison with hand-parallelized and -optimized version of the same benchmarks. Our results indicate that sequential programs can conveniently be parallelized for DSM systems using compiler directives but realizing performance gains as predicted by the performance model depends primarily on minimizing architecture-specific data locality overhead.
Address tracing for parallel machines

NASA Technical Reports Server (NTRS)

Stunkel, Craig B.; Janssens, Bob; Fuchs, W. Kent

1991-01-01

Recently implemented parallel system address-tracing methods based on several metrics are surveyed. The issues specific to collection of traces for both shared and distributed memory parallel computers are highlighted. Five general categories of address-trace collection methods are examined: hardware-captured, interrupt-based, simulation-based, altered microcode-based, and instrumented program-based traces. The problems unique to shared memory and distributed memory multiprocessors are examined separately.
Programming distributed memory architectures using Kali

NASA Technical Reports Server (NTRS)

Mehrotra, Piyush; Vanrosendale, John

1990-01-01

Programming nonshared memory systems is more difficult than programming shared memory systems, in part because of the relatively low level of current programming environments for such machines. A new programming environment is presented, Kali, which provides a global name space and allows direct access to remote data values. In order to retain efficiency, Kali provides a system on annotations, allowing the user to control those aspects of the program critical to performance, such as data distribution and load balancing. The primitives and constructs provided by the language is described, and some of the issues raised in translating a Kali program for execution on distributed memory systems are also discussed.
Efficient iteration in data-parallel programs with irregular and dynamically distributed data structures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Littlefield, R.J.

1990-02-01

To implement an efficient data-parallel program on a non-shared memory MIMD multicomputer, data and computations must be properly partitioned to achieve good load balance and locality of reference. Programs with irregular data reference patterns often require irregular partitions. Although good partitions may be easy to determine, they can be difficult or impossible to implement in programming languages that provide only regular data distributions, such as blocked or cyclic arrays. We are developing Onyx, a programming system that provides a shared memory model of distributed data structures and extends the concept of data distribution to include irregular and dynamic distributions. Thismore » provides a powerful means to specify irregular partitions. Perhaps surprisingly, programs using it can also execute efficiently. In this paper, we describe and evaluate the Onyx implementation of a model problem that repeatedly executes an irregular but fixed data reference pattern. On an NCUBE hypercube, the speed of the Onyx implementation is comparable to that of carefully handwritten message-passing code.« less
Memory access in shared virtual memory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Berrendorf, R.

1992-01-01

Shared virtual memory (SVM) is a virtual memory layer with a single address space on top of a distributed real memory on parallel computers. We examine the behavior and performance of SVM running a parallel program with medium-grained, loop-level parallelism on top of it. A simulator for the underlying parallel architecture can be used to examine the behavior of SVM more deeply. The influence of several parameters, such as the number of processors, page size, cold or warm start, and restricted page replication, is studied.
Memory access in shared virtual memory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Berrendorf, R.

1992-09-01

Shared virtual memory (SVM) is a virtual memory layer with a single address space on top of a distributed real memory on parallel computers. We examine the behavior and performance of SVM running a parallel program with medium-grained, loop-level parallelism on top of it. A simulator for the underlying parallel architecture can be used to examine the behavior of SVM more deeply. The influence of several parameters, such as the number of processors, page size, cold or warm start, and restricted page replication, is studied.
Vienna FORTRAN: A FORTRAN language extension for distributed memory multiprocessors

NASA Technical Reports Server (NTRS)

Chapman, Barbara; Mehrotra, Piyush; Zima, Hans

1991-01-01

Exploiting the performance potential of distributed memory machines requires a careful distribution of data across the processors. Vienna FORTRAN is a language extension of FORTRAN which provides the user with a wide range of facilities for such mapping of data structures. However, programs in Vienna FORTRAN are written using global data references. Thus, the user has the advantage of a shared memory programming paradigm while explicitly controlling the placement of data. The basic features of Vienna FORTRAN are presented along with a set of examples illustrating the use of these features.
Programming in Vienna Fortran

NASA Technical Reports Server (NTRS)

Chapman, Barbara; Mehrotra, Piyush; Zima, Hans

1992-01-01

Exploiting the full performance potential of distributed memory machines requires a careful distribution of data across the processors. Vienna Fortran is a language extension of Fortran which provides the user with a wide range of facilities for such mapping of data structures. In contrast to current programming practice, programs in Vienna Fortran are written using global data references. Thus, the user has the advantages of a shared memory programming paradigm while explicitly controlling the data distribution. In this paper, we present the language features of Vienna Fortran for FORTRAN 77, together with examples illustrating the use of these features.
Parallel computing for probabilistic fatigue analysis

NASA Technical Reports Server (NTRS)

Sues, Robert H.; Lua, Yuan J.; Smith, Mark D.

1993-01-01

This paper presents the results of Phase I research to investigate the most effective parallel processing software strategies and hardware configurations for probabilistic structural analysis. We investigate the efficiency of both shared and distributed-memory architectures via a probabilistic fatigue life analysis problem. We also present a parallel programming approach, the virtual shared-memory paradigm, that is applicable across both types of hardware. Using this approach, problems can be solved on a variety of parallel configurations, including networks of single or multiprocessor workstations. We conclude that it is possible to effectively parallelize probabilistic fatigue analysis codes; however, special strategies will be needed to achieve large-scale parallelism to keep large number of processors busy and to treat problems with the large memory requirements encountered in practice. We also conclude that distributed-memory architecture is preferable to shared-memory for achieving large scale parallelism; however, in the future, the currently emerging hybrid-memory architectures will likely be optimal.
Support for Debugging Automatically Parallelized Programs

NASA Technical Reports Server (NTRS)

Hood, Robert; Jost, Gabriele; Biegel, Bryan (Technical Monitor)

2001-01-01

This viewgraph presentation provides information on the technical aspects of debugging computer code that has been automatically converted for use in a parallel computing system. Shared memory parallelization and distributed memory parallelization entail separate and distinct challenges for a debugging program. A prototype system has been developed which integrates various tools for the debugging of automatically parallelized programs including the CAPTools Database which provides variable definition information across subroutines as well as array distribution information.
Global Arrays

DOE Office of Scientific and Technical Information (OSTI.GOV)

Krishnamoorthy, Sriram; Daily, Jeffrey A.; Vishnu, Abhinav

2015-11-01

Global Arrays (GA) is a distributed-memory programming model that allows for shared-memory-style programming combined with one-sided communication, to create a set of tools that combine high performance with ease-of-use. GA exposes a relatively straightforward programming abstraction, while supporting fully-distributed data structures, locality of reference, and high-performance communication. GA was originally formulated in the early 1990’s to provide a communication layer for the Northwest Chemistry (NWChem) suite of chemistry modeling codes that was being developed concurrently.
Parallel Programming Paradigms

DTIC Science & Technology

1987-07-01

Unclassified IS.. DECLASSIFICATIONIOOWNGRADIN G 16. DISTRIBUTION STATEMENT (of this Report) Distribution of this report is unlimited. 17...8416878 and by the Office of Naval Research Contracts No. N00014-86-K-0264 and No. N00014-85- K-0328. 8 ?~~ O . G 1 49 II Parallel Programming Paradigms...processors -. "to fetch from the same memory cell (list head) and thus seems to favor a shared memory - g implementation [37). In this dissertation, we
Parallelization of NAS Benchmarks for Shared Memory Multiprocessors

NASA Technical Reports Server (NTRS)

Waheed, Abdul; Yan, Jerry C.; Saini, Subhash (Technical Monitor)

1998-01-01

This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of porting to new generations of high performance computing systems to parallelization tools and compilers. Due to the simplicity of programming shared-memory multiprocessors, compiler developers have provided various facilities to allow the users to exploit parallelism. Native compilers on SGI Origin2000 support multiprocessing directives to allow users to exploit loop-level parallelism in their programs. Additionally, supporting tools can accomplish this process automatically and present the results of parallelization to the users. We experimented with these compiler directives and supporting tools by parallelizing sequential implementation of NAS benchmarks. Results reported in this paper indicate that with minimal effort, the performance gain is comparable with the hand-parallelized, carefully optimized, message-passing implementations of the same benchmarks.
UPC++ Programmer’s Guide (v1.0 2017.9)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bachan, J.; Baden, S.; Bonachea, D.

UPC++ is a C++11 library that provides Asynchronous Partitioned Global Address Space (APGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The APGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, APGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, allmore » operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.« less
UPC++ Programmer’s Guide, v1.0-2018.3.0

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bachan, J.; Baden, S.; Bonachea, Dan

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operationsmore » that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.« less

What Multilevel Parallel Programs do when you are not Watching: A Performance Analysis Case Study Comparing MPI/OpenMP, MLP, and Nested OpenMP

NASA Technical Reports Server (NTRS)

Jost, Gabriele; Labarta, Jesus; Gimenez, Judit

2004-01-01

With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors, parallel programming techniques have evolved that support parallelism beyond a single level. When comparing the performance of applications based on different programming paradigms, it is important to differentiate between the influence of the programming model itself and other factors, such as implementation specific behavior of the operating system (OS) or architectural issues. Rewriting-a large scientific application in order to employ a new programming paradigms is usually a time consuming and error prone task. Before embarking on such an endeavor it is important to determine that there is really a gain that would not be possible with the current implementation. A detailed performance analysis is crucial to clarify these issues. The multilevel programming paradigms considered in this study are hybrid MPI/OpenMP, MLP, and nested OpenMP. The hybrid MPI/OpenMP approach is based on using MPI [7] for the coarse grained parallelization and OpenMP [9] for fine grained loop level parallelism. The MPI programming paradigm assumes a private address space for each process. Data is transferred by explicitly exchanging messages via calls to the MPI library. This model was originally designed for distributed memory architectures but is also suitable for shared memory systems. The second paradigm under consideration is MLP which was developed by Taft. The approach is similar to MPi/OpenMP, using a mix of coarse grain process level parallelization and loop level OpenMP parallelization. As it is the case with MPI, a private address space is assumed for each process. The MLP approach was developed for ccNUMA architectures and explicitly takes advantage of the availability of shared memory. A shared memory arena which is accessible by all processes is required. Communication is done by reading from and writing to the shared memory.
Programming model for distributed intelligent systems

NASA Technical Reports Server (NTRS)

Sztipanovits, J.; Biegl, C.; Karsai, G.; Bogunovic, N.; Purves, B.; Williams, R.; Christiansen, T.

1988-01-01

A programming model and architecture which was developed for the design and implementation of complex, heterogeneous measurement and control systems is described. The Multigraph Architecture integrates artificial intelligence techniques with conventional software technologies, offers a unified framework for distributed and shared memory based parallel computational models and supports multiple programming paradigms. The system can be implemented on different hardware architectures and can be adapted to strongly different applications.
Real-time implementations of image segmentation algorithms on shared memory multicore architecture: a survey (Conference Presentation)

NASA Astrophysics Data System (ADS)

Akil, Mohamed

2017-05-01

The real-time processing is getting more and more important in many image processing applications. Image segmentation is one of the most fundamental tasks image analysis. As a consequence, many different approaches for image segmentation have been proposed. The watershed transform is a well-known image segmentation tool. The watershed transform is a very data intensive task. To achieve acceleration and obtain real-time processing of watershed algorithms, parallel architectures and programming models for multicore computing have been developed. This paper focuses on the survey of the approaches for parallel implementation of sequential watershed algorithms on multicore general purpose CPUs: homogeneous multicore processor with shared memory. To achieve an efficient parallel implementation, it's necessary to explore different strategies (parallelization/distribution/distributed scheduling) combined with different acceleration and optimization techniques to enhance parallelism. In this paper, we give a comparison of various parallelization of sequential watershed algorithms on shared memory multicore architecture. We analyze the performance measurements of each parallel implementation and the impact of the different sources of overhead on the performance of the parallel implementations. In this comparison study, we also discuss the advantages and disadvantages of the parallel programming models. Thus, we compare the OpenMP (an application programming interface for multi-Processing) with Ptheads (POSIX Threads) to illustrate the impact of each parallel programming model on the performance of the parallel implementations.
Enhancing Application Performance Using Mini-Apps: Comparison of Hybrid Parallel Programming Paradigms

NASA Technical Reports Server (NTRS)

Lawson, Gary; Poteat, Michael; Sosonkina, Masha; Baurle, Robert; Hammond, Dana

2016-01-01

In this work, several mini-apps have been created to enhance a real-world application performance, namely the VULCAN code for complex flow analysis developed at the NASA Langley Research Center. These mini-apps explore hybrid parallel programming paradigms with Message Passing Interface (MPI) for distributed memory access and either Shared MPI (SMPI) or OpenMP for shared memory accesses. Performance testing shows that MPI+SMPI yields the best execution performance, while requiring the largest number of code changes. A maximum speedup of 23X was measured for MPI+SMPI, but only 10X was measured for MPI+OpenMP.
Execution time support for scientific programs on distributed memory machines

NASA Technical Reports Server (NTRS)

Berryman, Harry; Saltz, Joel; Scroggs, Jeffrey

1990-01-01

Optimizations are considered that are required for efficient execution of code segments that consists of loops over distributed data structures. The PARTI (Parallel Automated Runtime Toolkit at ICASE) execution time primitives are designed to carry out these optimizations and can be used to implement a wide range of scientific algorithms on distributed memory machines. These primitives allow the user to control array mappings in a way that gives an appearance of shared memory. Computations can be based on a global index set. Primitives are used to carry out gather and scatter operations on distributed arrays. Communications patterns are derived at runtime, and the appropriate send and receive messages are automatically generated.
Automated quantitative muscle biopsy analysis system

NASA Technical Reports Server (NTRS)

Castleman, Kenneth R. (Inventor)

1980-01-01

An automated system to aid the diagnosis of neuromuscular diseases by producing fiber size histograms utilizing histochemically stained muscle biopsy tissue. Televised images of the microscopic fibers are processed electronically by a multi-microprocessor computer, which isolates, measures, and classifies the fibers and displays the fiber size distribution. The architecture of the multi-microprocessor computer, which is iterated to any required degree of complexity, features a series of individual microprocessors P.sub.n each receiving data from a shared memory M.sub.n-1 and outputing processed data to a separate shared memory M.sub.n+1 under control of a program stored in dedicated memory M.sub.n.
Using Coarrays to Parallelize Legacy Fortran Applications: Strategy and Case Study

DOE PAGES

Radhakrishnan, Hari; Rouson, Damian W. I.; Morris, Karla; ...

2015-01-01

This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were donemore » using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.« less
Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Li, Xiaoye; Husbands, Parry; Biswas, Rupak; Biegel, Bryan (Technical Monitor)

2002-01-01

The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. For systems that are ill-conditioned, it is often necessary to use a preconditioning technique. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and ILU(O) preconditioned CG (PCG) using different programming paradigms and architectures. Results show that for this class of applications: ordering significantly improves overall performance on both distributed and distributed shared-memory systems, that cache reuse may be more important than reducing communication, that it is possible to achieve message-passing performance using shared-memory constructs through careful data ordering and distribution, and that a hybrid MPI+OpenMP paradigm increases programming complexity with little performance gains. A implementation of CG on the Cray MTA does not require special ordering or partitioning to obtain high efficiency and scalability, giving it a distinct advantage for adaptive applications; however, it shows limited scalability for PCG due to a lack of thread level parallelism.
Directions in parallel programming: HPF, shared virtual memory and object parallelism in pC++

NASA Technical Reports Server (NTRS)

Bodin, Francois; Priol, Thierry; Mehrotra, Piyush; Gannon, Dennis

1994-01-01

Fortran and C++ are the dominant programming languages used in scientific computation. Consequently, extensions to these languages are the most popular for programming massively parallel computers. We discuss two such approaches to parallel Fortran and one approach to C++. The High Performance Fortran Forum has designed HPF with the intent of supporting data parallelism on Fortran 90 applications. HPF works by asking the user to help the compiler distribute and align the data structures with the distributed memory modules in the system. Fortran-S takes a different approach in which the data distribution is managed by the operating system and the user provides annotations to indicate parallel control regions. In the case of C++, we look at pC++ which is based on a concurrent aggregate parallel model.
Implementations of BLAST for parallel computers.

PubMed

Jülich, A

1995-02-01

The BLAST sequence comparison programs have been ported to a variety of parallel computers-the shared memory machine Cray Y-MP 8/864 and the distributed memory architectures Intel iPSC/860 and nCUBE. Additionally, the programs were ported to run on workstation clusters. We explain the parallelization techniques and consider the pros and cons of these methods. The BLAST programs are very well suited for parallelization for a moderate number of processors. We illustrate our results using the program blastp as an example. As input data for blastp, a 799 residue protein query sequence and the protein database PIR were used.
Distributed simulation using a real-time shared memory network

NASA Technical Reports Server (NTRS)

Simon, Donald L.; Mattern, Duane L.; Wong, Edmond; Musgrave, Jeffrey L.

1993-01-01

The Advanced Control Technology Branch of the NASA Lewis Research Center performs research in the area of advanced digital controls for aeronautic and space propulsion systems. This work requires the real-time implementation of both control software and complex dynamical models of the propulsion system. We are implementing these systems in a distributed, multi-vendor computer environment. Therefore, a need exists for real-time communication and synchronization between the distributed multi-vendor computers. A shared memory network is a potential solution which offers several advantages over other real-time communication approaches. A candidate shared memory network was tested for basic performance. The shared memory network was then used to implement a distributed simulation of a ramjet engine. The accuracy and execution time of the distributed simulation was measured and compared to the performance of the non-partitioned simulation. The ease of partitioning the simulation, the minimal time required to develop for communication between the processors and the resulting execution time all indicate that the shared memory network is a real-time communication technique worthy of serious consideration.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Jones, J.P.; Bangs, A.L.; Butler, P.L.

Hetero Helix is a programming environment which simulates shared memory on a heterogeneous network of distributed-memory computers. The machines in the network may vary with respect to their native operating systems and internal representation of numbers. Hetero Helix presents a simple programming model to developers, and also considers the needs of designers, system integrators, and maintainers. The key software technology underlying Hetero Helix is the use of a compiler'' which analyzes the data structures in shared memory and automatically generates code which translates data representations from the format native to each machine into a common format, and vice versa. Themore » design of Hetero Helix was motivated in particular by the requirements of robotics applications. Hetero Helix has been used successfully in an integration effort involving 27 CPUs in a heterogeneous network and a body of software totaling roughly 100,00 lines of code. 25 refs., 6 figs.« less
Tuning collective communication for Partitioned Global Address Space programming models

DOE PAGES

Nishtala, Rajesh; Zheng, Yili; Hargrove, Paul H.; ...

2011-06-12

Partitioned Global Address Space (PGAS) languages offer programmers the convenience of a shared memory programming style combined with locality control necessary to run on large-scale distributed memory systems. Even within a PGAS language programmers often need to perform global communication operations such as broadcasts or reductions, which are best performed as collective operations in which a group of threads work together to perform the operation. In this study we consider the problem of implementing collective communication within PGAS languages and explore some of the design trade-offs in both the interface and implementation. In particular, PGAS collectives have semantic issues thatmore » are different than in send–receive style message passing programs, and different implementation approaches that take advantage of the one-sided communication style in these languages. We present an implementation framework for PGAS collectives as part of the GASNet communication layer, which supports shared memory, distributed memory and hybrids. The framework supports a broad set of algorithms for each collective, over which the implementation may be automatically tuned. In conclusion, we demonstrate the benefit of optimized GASNet collectives using application benchmarks written in UPC, and demonstrate that the GASNet collectives can deliver scalable performance on a variety of state-of-the-art parallel machines including a Cray XT4, an IBM BlueGene/P, and a Sun Constellation system with InfiniBand interconnect.« less
Virtual memory support for distributed computing environments using a shared data object model

NASA Astrophysics Data System (ADS)

Huang, F.; Bacon, J.; Mapp, G.

1995-12-01

Conventional storage management systems provide one interface for accessing memory segments and another for accessing secondary storage objects. This hinders application programming and affects overall system performance due to mandatory data copying and user/kernel boundary crossings, which in the microkernel case may involve context switches. Memory-mapping techniques may be used to provide programmers with a unified view of the storage system. This paper extends such techniques to support a shared data object model for distributed computing environments in which good support for coherence and synchronization is essential. The approach is based on a microkernel, typed memory objects, and integrated coherence control. A microkernel architecture is used to support multiple coherence protocols and the addition of new protocols. Memory objects are typed and applications can choose the most suitable protocols for different types of object to avoid protocol mismatch. Low-level coherence control is integrated with high-level concurrency control so that the number of messages required to maintain memory coherence is reduced and system-wide synchronization is realized without severely impacting the system performance. These features together contribute a novel approach to the support for flexible coherence under application control.
Cooperative Data Sharing: Simple Support for Clusters of SMP Nodes

NASA Technical Reports Server (NTRS)

DiNucci, David C.; Balley, David H. (Technical Monitor)

1997-01-01

Libraries like PVM and MPI send typed messages to allow for heterogeneous cluster computing. Lower-level libraries, such as GAM, provide more efficient access to communication by removing the need to copy messages between the interface and user space in some cases. still lower-level interfaces, such as UNET, get right down to the hardware level to provide maximum performance. However, these are all still interfaces for passing messages from one process to another, and have limited utility in a shared-memory environment, due primarily to the fact that message passing is just another term for copying. This drawback is made more pertinent by today's hybrid architectures (e.g. clusters of SMPs), where it is difficult to know beforehand whether two communicating processes will share memory. As a result, even portable language tools (like HPF compilers) must either map all interprocess communication, into message passing with the accompanying performance degradation in shared memory environments, or they must check each communication at run-time and implement the shared-memory case separately for efficiency. Cooperative Data Sharing (CDS) is a single user-level API which abstracts all communication between processes into the sharing and access coordination of memory regions, in a model which might be described as "distributed shared messages" or "large-grain distributed shared memory". As a result, the user programs to a simple latency-tolerant abstract communication specification which can be mapped efficiently to either a shared-memory or message-passing based run-time system, depending upon the available architecture. Unlike some distributed shared memory interfaces, the user still has complete control over the assignment of data to processors, the forwarding of data to its next likely destination, and the queuing of data until it is needed, so even the relatively high latency present in clusters can be accomodated. CDS does not require special use of an MMU, which can add overhead to some DSM systems, and does not require an SPMD programming model. unlike some message-passing interfaces, CDS allows the user to implement efficient demand-driven applications where processes must "fight" over data, and does not perform copying if processes share memory and do not attempt concurrent writes. CDS also supports heterogeneous computing, dynamic process creation, handlers, and a very simple thread-arbitration mechanism. Additional support for array subsections is currently being considered. The CDS1 API, which forms the kernel of CDS, is built primarily upon only 2 communication primitives, one process initiation primitive, and some data translation (and marshalling) routines, memory allocation routines, and priority control routines. The entire current collection of 28 routines provides enough functionality to implement most (or all) of MPI 1 and 2, which has a much larger interface consisting of hundreds of routines. still, the API is small enough to consider integrating into standard os interfaces for handling inter-process communication in a network-independent way. This approach would also help to solve many of the problems plaguing other higher-level standards such as MPI and PVM which must, in some cases, "play OS" to adequately address progress and process control issues. The CDS2 API, a higher level of interface roughly equivalent in functionality to MPI and to be built entirely upon CDS1, is still being designed. It is intended to add support for the equivalent of communicators, reduction and other collective operations, process topologies, additional support for process creation, and some automatic memory management. CDS2 will not exactly match MPI, because the copy-free semantics of communication from CDS1 will be supported. CDS2 application programs will be free to carefully also use CDS1. CDS1 has been implemented on networks of workstations running unmodified Unix-based operating systems, using UDP/IP and vendor-supplied high- performance locks. Although its inter-node performance is currently unimpressive due to rudimentary implementation technique, it even now outperforms highly-optimized MPI implementation on intra-node communication due to its support for non-copy communication. The similarity of the CDS1 architecture to that of other projects such as UNET and TRAP suggests that the inter-node performance can be increased significantly to surpass MPI or PVM, and it may be possible to migrate some of its functionality to communication controllers.
Automatic Generation of Directive-Based Parallel Programs for Shared Memory Parallel Systems

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Yan, Jerry; Frumkin, Michael

2000-01-01

The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. As great progress was made in hardware and software technologies, performance of parallel programs with compiler directives has demonstrated large improvement. The introduction of OpenMP directives, the industrial standard for shared-memory programming, has minimized the issue of portability. Due to its ease of programming and its good performance, the technique has become very popular. In this study, we have extended CAPTools, a computer-aided parallelization toolkit, to automatically generate directive-based, OpenMP, parallel programs. We outline techniques used in the implementation of the tool and present test results on the NAS parallel benchmarks and ARC3D, a CFD application. This work demonstrates the great potential of using computer-aided tools to quickly port parallel programs and also achieve good performance.
Implementation of a parallel unstructured Euler solver on shared and distributed memory architectures

NASA Technical Reports Server (NTRS)

Mavriplis, D. J.; Das, Raja; Saltz, Joel; Vermeland, R. E.

1992-01-01

An efficient three dimensional unstructured Euler solver is parallelized on a Cray Y-MP C90 shared memory computer and on an Intel Touchstone Delta distributed memory computer. This paper relates the experiences gained and describes the software tools and hardware used in this study. Performance comparisons between two differing architectures are made.
Performance Evaluation of Remote Memory Access (RMA) Programming on Shared Memory Parallel Computers

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Jost, Gabriele; Biegel, Bryan A. (Technical Monitor)

2002-01-01

The purpose of this study is to evaluate the feasibility of remote memory access (RMA) programming on shared memory parallel computers. We discuss different RMA based implementations of selected CFD application benchmark kernels and compare them to corresponding message passing based codes. For the message-passing implementation we use MPI point-to-point and global communication routines. For the RMA based approach we consider two different libraries supporting this programming model. One is a shared memory parallelization library (SMPlib) developed at NASA Ames, the other is the MPI-2 extensions to the MPI Standard. We give timing comparisons for the different implementation strategies and discuss the performance.
Scaling Irregular Applications through Data Aggregation and Software Multithreading

DOE Office of Scientific and Technical Information (OSTI.GOV)

Morari, Alessandro; Tumeo, Antonino; Chavarría-Miranda, Daniel

Bioinformatics, data analytics, semantic databases, knowledge discovery are emerging high performance application areas that exploit dynamic, linked data structures such as graphs, unbalanced trees or unstructured grids. These data structures usually are very large, requiring significantly more memory than available on single shared memory systems. Additionally, these data structures are difficult to partition on distributed memory systems. They also present poor spatial and temporal locality, thus generating unpredictable memory and network accesses. The Partitioned Global Address Space (PGAS) programming model seems suitable for these applications, because it allows using a shared memory abstraction across distributed-memory clusters. However, current PGAS languagesmore » and libraries are built to target regular remote data accesses and block transfers. Furthermore, they usually rely on the Single Program Multiple Data (SPMD) parallel control model, which is not well suited to the fine grained, dynamic and unbalanced parallelism of irregular applications. In this paper we present {\\bf GMT} (Global Memory and Threading library), a custom runtime library that enables efficient execution of irregular applications on commodity clusters. GMT integrates a PGAS data substrate with simple fork/join parallelism and provides automatic load balancing on a per node basis. It implements multi-level aggregation and lightweight multithreading to maximize memory and network bandwidth with fine-grained data accesses and tolerate long data access latencies. A key innovation in the GMT runtime is its thread specialization (workers, helpers and communication threads) that realize the overall functionality. We compare our approach with other PGAS models, such as UPC running using GASNet, and hand-optimized MPI code on a set of typical large-scale irregular applications, demonstrating speedups of an order of magnitude.« less
Mnemonic transmission, social contagion, and emergence of collective memory: Influence of emotional valence, group structure, and information distribution.

PubMed

Choi, Hae-Yoon; Kensinger, Elizabeth A; Rajaram, Suparna

2017-09-01

Social transmission of memory and its consequence on collective memory have generated enduring interdisciplinary interest because of their widespread significance in interpersonal, sociocultural, and political arenas. We tested the influence of 3 key factors-emotional salience of information, group structure, and information distribution-on mnemonic transmission, social contagion, and collective memory. Participants individually studied emotionally salient (negative or positive) and nonemotional (neutral) picture-word pairs that were completely shared, partially shared, or unshared within participant triads, and then completed 3 consecutive recalls in 1 of 3 conditions: individual-individual-individual (control), collaborative-collaborative (identical group; insular structure)-individual, and collaborative-collaborative (reconfigured group; diverse structure)-individual. Collaboration enhanced negative memories especially in insular group structure and especially for shared information, and promoted collective forgetting of positive memories. Diverse group structure reduced this negativity effect. Unequally distributed information led to social contagion that creates false memories; diverse structure propagated a greater variety of false memories whereas insular structure promoted confidence in false recognition and false collective memory. A simultaneous assessment of network structure, information distribution, and emotional valence breaks new ground to specify how network structure shapes the spread of negative memories and false memories, and the emergence of collective memory. (PsycINFO Database Record (c) 2017 APA, all rights reserved).

Work stealing for GPU-accelerated parallel programs in a global address space framework: WORK STEALING ON GPU-ACCELERATED SYSTEMS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram

Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a functionmore » of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain.« less
Work stealing for GPU-accelerated parallel programs in a global address space framework

DOE Office of Scientific and Technical Information (OSTI.GOV)

Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram

Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a functionmore » of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain« less
High Performance Programming Using Explicit Shared Memory Model on Cray T3D1

NASA Technical Reports Server (NTRS)

Simon, Horst D.; Saini, Subhash; Grassi, Charles

1994-01-01

The Cray T3D system is the first-phase system in Cray Research, Inc.'s (CRI) three-phase massively parallel processing (MPP) program. This system features a heterogeneous architecture that closely couples DEC's Alpha microprocessors and CRI's parallel-vector technology, i.e., the Cray Y-MP and Cray C90. An overview of the Cray T3D hardware and available programming models is presented. Under Cray Research adaptive Fortran (CRAFT) model four programming methods (data parallel, work sharing, message-passing using PVM, and explicit shared memory model) are available to the users. However, at this time data parallel and work sharing programming models are not available to the user community. The differences between standard PVM and CRI's PVM are highlighted with performance measurements such as latencies and communication bandwidths. We have found that the performance of neither standard PVM nor CRI s PVM exploits the hardware capabilities of the T3D. The reasons for the bad performance of PVM as a native message-passing library are presented. This is illustrated by the performance of NAS Parallel Benchmarks (NPB) programmed in explicit shared memory model on Cray T3D. In general, the performance of standard PVM is about 4 to 5 times less than obtained by using explicit shared memory model. This degradation in performance is also seen on CM-5 where the performance of applications using native message-passing library CMMD on CM-5 is also about 4 to 5 times less than using data parallel methods. The issues involved (such as barriers, synchronization, invalidating data cache, aligning data cache etc.) while programming in explicit shared memory model are discussed. Comparative performance of NPB using explicit shared memory programming model on the Cray T3D and other highly parallel systems such as the TMC CM-5, Intel Paragon, Cray C90, IBM-SP1, etc. is presented.
Flexible language constructs for large parallel programs

NASA Technical Reports Server (NTRS)

Rosing, Matthew; Schnabel, Robert

1993-01-01

The goal of the research described is to develop flexible language constructs for writing large data parallel numerical programs for distributed memory (MIMD) multiprocessors. Previously, several models have been developed to support synchronization and communication. Models for global synchronization include SIMD (Single Instruction Multiple Data), SPMD (Single Program Multiple Data), and sequential programs annotated with data distribution statements. The two primary models for communication include implicit communication based on shared memory and explicit communication based on messages. None of these models by themselves seem sufficient to permit the natural and efficient expression of the variety of algorithms that occur in large scientific computations. An overview of a new language that combines many of these programming models in a clean manner is given. This is done in a modular fashion such that different models can be combined to support large programs. Within a module, the selection of a model depends on the algorithm and its efficiency requirements. An overview of the language and discussion of some of the critical implementation details is given.
An OpenACC-Based Unified Programming Model for Multi-accelerator Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kim, Jungwon; Lee, Seyong; Vetter, Jeffrey S

2015-01-01

This paper proposes a novel SPMD programming model of OpenACC. Our model integrates the different granularities of parallelism from vector-level parallelism to node-level parallelism into a single, unified model based on OpenACC. It allows programmers to write programs for multiple accelerators using a uniform programming model whether they are in shared or distributed memory systems. We implement a prototype of our model and evaluate its performance with a GPU-based supercomputer using three benchmark applications.
Hypercluster Parallel Processor

NASA Technical Reports Server (NTRS)

Blech, Richard A.; Cole, Gary L.; Milner, Edward J.; Quealy, Angela

1992-01-01

Hypercluster computer system includes multiple digital processors, operation of which coordinated through specialized software. Configurable according to various parallel-computing architectures of shared-memory or distributed-memory class, including scalar computer, vector computer, reduced-instruction-set computer, and complex-instruction-set computer. Designed as flexible, relatively inexpensive system that provides single programming and operating environment within which one can investigate effects of various parallel-computing architectures and combinations on performance in solution of complicated problems like those of three-dimensional flows in turbomachines. Hypercluster software and architectural concepts are in public domain.
Static analysis of the hull plate using the finite element method

NASA Astrophysics Data System (ADS)

Ion, A.

2015-11-01

This paper aims at presenting the static analysis for two levels of a container ship's construction as follows: the first level is at the girder / hull plate and the second level is conducted at the entire strength hull of the vessel. This article will describe the work for the static analysis of a hull plate. We shall use the software package ANSYS Mechanical 14.5. The program is run on a computer with four Intel Xeon X5260 CPU processors at 3.33 GHz, 32 GB memory installed. In terms of software, the shared memory parallel version of ANSYS refers to running ANSYS across multiple cores on a SMP system. The distributed memory parallel version of ANSYS (Distributed ANSYS) refers to running ANSYS across multiple processors on SMP systems or DMP systems.
A portable approach for PIC on emerging architectures

NASA Astrophysics Data System (ADS)

Decyk, Viktor

2016-03-01

A portable approach for designing Particle-in-Cell (PIC) algorithms on emerging exascale computers, is based on the recognition that 3 distinct programming paradigms are needed. They are: low level vector (SIMD) processing, middle level shared memory parallel programing, and high level distributed memory programming. In addition, there is a memory hierarchy associated with each level. Such algorithms can be initially developed using vectorizing compilers, OpenMP, and MPI. This is the approach recommended by Intel for the Phi processor. These algorithms can then be translated and possibly specialized to other programming models and languages, as needed. For example, the vector processing and shared memory programming might be done with CUDA instead of vectorizing compilers and OpenMP, but generally the algorithm itself is not greatly changed. The UCLA PICKSC web site at http://www.idre.ucla.edu/ contains example open source skeleton codes (mini-apps) illustrating each of these three programming models, individually and in combination. Fortran2003 now supports abstract data types, and design patterns can be used to support a variety of implementations within the same code base. Fortran2003 also supports interoperability with C so that implementations in C languages are also easy to use. Finally, main codes can be translated into dynamic environments such as Python, while still taking advantage of high performing compiled languages. Parallel languages are still evolving with interesting developments in co-Array Fortran, UPC, and OpenACC, among others, and these can also be supported within the same software architecture. Work supported by NSF and DOE Grants.
Flexible Language Constructs for Large Parallel Programs

DOE PAGES

Rosing, Matt; Schnabel, Robert

1994-01-01

The goal of the research described in this article is to develop flexible language constructs for writing large data parallel numerical programs for distributed memory (multiple instruction multiple data [MIMD]) multiprocessors. Previously, several models have been developed to support synchronization and communication. Models for global synchronization include single instruction multiple data (SIMD), single program multiple data (SPMD), and sequential programs annotated with data distribution statements. The two primary models for communication include implicit communication based on shared memory and explicit communication based on messages. None of these models by themselves seem sufficient to permit the natural and efficient expression ofmore » the variety of algorithms that occur in large scientific computations. In this article, we give an overview of a new language that combines many of these programming models in a clean manner. This is done in a modular fashion such that different models can be combined to support large programs. Within a module, the selection of a model depends on the algorithm and its efficiency requirements. In this article, we give an overview of the language and discuss some of the critical implementation details.« less
Parallel Computing for Probabilistic Response Analysis of High Temperature Composites

NASA Technical Reports Server (NTRS)

Sues, R. H.; Lua, Y. J.; Smith, M. D.

1994-01-01

The objective of this Phase I research was to establish the required software and hardware strategies to achieve large scale parallelism in solving PCM problems. To meet this objective, several investigations were conducted. First, we identified the multiple levels of parallelism in PCM and the computational strategies to exploit these parallelisms. Next, several software and hardware efficiency investigations were conducted. These involved the use of three different parallel programming paradigms and solution of two example problems on both a shared-memory multiprocessor and a distributed-memory network of workstations.
Support of Multidimensional Parallelism in the OpenMP Programming Model

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Jost, Gabriele

2003-01-01

OpenMP is the current standard for shared-memory programming. While providing ease of parallel programming, the OpenMP programming model also has limitations which often effect the scalability of applications. Examples for these limitations are work distribution and point-to-point synchronization among threads. We propose extensions to the OpenMP programming model which allow the user to easily distribute the work in multiple dimensions and synchronize the workflow among the threads. The proposed extensions include four new constructs and the associated runtime library. They do not require changes to the source code and can be implemented based on the existing OpenMP standard. We illustrate the concept in a prototype translator and test with benchmark codes and a cloud modeling code.
A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hoemmen, Mark

2010-11-01

Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, 'communication' includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches formore » orthogonalizing the vectors within each block ('normalization'). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5-20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.« less
Partitioning problems in parallel, pipelined and distributed computing

NASA Technical Reports Server (NTRS)

Bokhari, S.

1985-01-01

The problem of optimally assigning the modules of a parallel program over the processors of a multiple computer system is addressed. A Sum-Bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple satellite system: partitioning multiple chain structured parallel programs, multiple arbitrarily structured serial programs and single tree structured parallel programs. In addition, the problems of partitioning chain structured parallel programs across chain connected systems and across shared memory (or shared bus) systems are also solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple computer architectures for a wide range of problems of practical interest.
Automatic Generation of OpenMP Directives and Its Application to Computational Fluid Dynamics Codes

NASA Technical Reports Server (NTRS)

Yan, Jerry; Jin, Haoqiang; Frumkin, Michael; Yan, Jerry (Technical Monitor)

2000-01-01

The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. As great progress was made in hardware and software technologies, performance of parallel programs with compiler directives has demonstrated large improvement. The introduction of OpenMP directives, the industrial standard for shared-memory programming, has minimized the issue of portability. In this study, we have extended CAPTools, a computer-aided parallelization toolkit, to automatically generate OpenMP-based parallel programs with nominal user assistance. We outline techniques used in the implementation of the tool and discuss the application of this tool on the NAS Parallel Benchmarks and several computational fluid dynamics codes. This work demonstrates the great potential of using the tool to quickly port parallel programs and also achieve good performance that exceeds some of the commercial tools.
A message passing kernel for the hypercluster parallel processing test bed

NASA Technical Reports Server (NTRS)

Blech, Richard A.; Quealy, Angela; Cole, Gary L.

1989-01-01

A Message-Passing Kernel (MPK) for the Hypercluster parallel-processing test bed is described. The Hypercluster is being developed at the NASA Lewis Research Center to support investigations of parallel algorithms and architectures for computational fluid and structural mechanics applications. The Hypercluster resembles the hypercube architecture except that each node consists of multiple processors communicating through shared memory. The MPK efficiently routes information through the Hypercluster, using a message-passing protocol when necessary and faster shared-memory communication whenever possible. The MPK also interfaces all of the processors with the Hypercluster operating system (HYCLOPS), which runs on a Front-End Processor (FEP). This approach distributes many of the I/O tasks to the Hypercluster processors and eliminates the need for a separate I/O support program on the FEP.
Proceedings: Sisal `93

DOE Office of Scientific and Technical Information (OSTI.GOV)

Feo, J.T.

1993-10-01

This report contain papers on: Programmability and performance issues; The case of an iterative partial differential equation solver; Implementing the kernal of the Australian Region Weather Prediction Model in Sisal; Even and quarter-even prime length symmetric FFTs and their Sisal Implementations; Top-down thread generation for Sisal; Overlapping communications and computations on NUMA architechtures; Compiling technique based on dataflow analysis for funtional programming language Valid; Copy elimination for true multidimensional arrays in Sisal 2.0; Increasing parallelism for an optimization that reduces copying in IF2 graphs; Caching in on Sisal; Cache performance of Sisal Vs. FORTRAN; FFT algorithms on a shared-memory multiprocessor;more » A parallel implementation of nonnumeric search problems in Sisal; Computer vision algorithms in Sisal; Compilation of Sisal for a high-performance data driven vector processor; Sisal on distributed memory machines; A virtual shared addressing system for distributed memory Sisal; Developing a high-performance FFT algorithm in Sisal for a vector supercomputer; Implementation issues for IF2 on a static data-flow architechture; and Systematic control of parallelism in array-based data-flow computation. Selected papers have been indexed separately for inclusion in the Energy Science and Technology Database.« less
Enhancing Application Performance Using Mini-Apps: Comparison of Hybrid Parallel Programming Paradigms

NASA Technical Reports Server (NTRS)

Lawson, Gary; Sosonkina, Masha; Baurle, Robert; Hammond, Dana

2017-01-01

In many fields, real-world applications for High Performance Computing have already been developed. For these applications to stay up-to-date, new parallel strategies must be explored to yield the best performance; however, restructuring or modifying a real-world application may be daunting depending on the size of the code. In this case, a mini-app may be employed to quickly explore such options without modifying the entire code. In this work, several mini-apps have been created to enhance a real-world application performance, namely the VULCAN code for complex flow analysis developed at the NASA Langley Research Center. These mini-apps explore hybrid parallel programming paradigms with Message Passing Interface (MPI) for distributed memory access and either Shared MPI (SMPI) or OpenMP for shared memory accesses. Performance testing shows that MPI+SMPI yields the best execution performance, while requiring the largest number of code changes. A maximum speedup of 23 was measured for MPI+SMPI, but only 11 was measured for MPI+OpenMP.
A simple modern correctness condition for a space-based high-performance multiprocessor

NASA Technical Reports Server (NTRS)

Probst, David K.; Li, Hon F.

1992-01-01

A number of U.S. national programs, including space-based detection of ballistic missile launches, envisage putting significant computing power into space. Given sufficient progress in low-power VLSI, multichip-module packaging and liquid-cooling technologies, we will see design of high-performance multiprocessors for individual satellites. In very high speed implementations, performance depends critically on tolerating large latencies in interprocessor communication; without latency tolerance, performance is limited by the vastly differing time scales in processor and data-memory modules, including interconnect times. The modern approach to tolerating remote-communication cost in scalable, shared-memory multiprocessors is to use a multithreaded architecture, and alter the semantics of shared memory slightly, at the price of forcing the programmer either to reason about program correctness in a relaxed consistency model or to agree to program in a constrained style. The literature on multiprocessor correctness conditions has become increasingly complex, and sometimes confusing, which may hinder its practical application. We propose a simple modern correctness condition for a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and the parallel programming system.
Contention Modeling for Multithreaded Distributed Shared Memory Machines: The Cray XMT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Secchi, Simone; Tumeo, Antonino; Villa, Oreste

Distributed Shared Memory (DSM) machines are a wide class of multi-processor computing systems where a large virtually-shared address space is mapped on a network of physically distributed memories. High memory latency and network contention are two of the main factors that limit performance scaling of such architectures. Modern high-performance computing DSM systems have evolved toward exploitation of massive hardware multi-threading and fine-grained memory hashing to tolerate irregular latencies, avoid network hot-spots and enable high scaling. In order to model the performance of such large-scale machines, parallel simulation has been proved to be a promising approach to achieve good accuracy inmore » reasonable times. One of the most critical factors in solving the simulation speed-accuracy trade-off is network modeling. The Cray XMT is a massively multi-threaded supercomputing architecture that belongs to the DSM class, since it implements a globally-shared address space abstraction on top of a physically distributed memory substrate. In this paper, we discuss the development of a contention-aware network model intended to be integrated in a full-system XMT simulator. We start by measuring the effects of network contention in a 128-processor XMT machine and then investigate the trade-off that exists between simulation accuracy and speed, by comparing three network models which operate at different levels of accuracy. The comparison and model validation is performed by executing a string-matching algorithm on the full-system simulator and on the XMT, using three datasets that generate noticeably different contention patterns.« less
The FORCE - A highly portable parallel programming language

NASA Technical Reports Server (NTRS)

Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger

1989-01-01

This paper explains why the FORCE parallel programming language is easily portable among six different shared-memory multiprocessors, and how a two-level macro preprocessor makes it possible to hide low-level machine dependencies and to build machine-independent high-level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared-memory multiprocessor executing them.

The FORCE: A highly portable parallel programming language

NASA Technical Reports Server (NTRS)

Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger

1989-01-01

Here, it is explained why the FORCE parallel programming language is easily portable among six different shared-memory microprocessors, and how a two-level macro preprocessor makes it possible to hide low level machine dependencies and to build machine-independent high level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared memory multiprocessor executing them.
Proceedings of the second SISAL users` conference

DOE Office of Scientific and Technical Information (OSTI.GOV)

Feo, J T; Frerking, C; Miller, P J

1992-12-01

This report contains papers on the following topics: A sisal code for computing the fourier transform on S{sub N}; five ways to fill your knapsack; simulating material dislocation motion in sisal; candis as an interface for sisal; parallelisation and performance of the burg algorithm on a shared-memory multiprocessor; use of genetic algorithm in sisal to solve the file design problem; implementing FFT`s in sisal; programming and evaluating the performance of signal processing applications in the sisal programming environment; sisal and Von Neumann-based languages: translation and intercommunication; an IF2 code generator for ADAM architecture; program partitioning for NUMA multiprocessor computer systems;more » mapping functional parallelism on distributed memory machines; implicit array copying: prevention is better than cure ; mathematical syntax for sisal; an approach for optimizing recursive functions; implementing arrays in sisal 2.0; Fol: an object oriented extension to the sisal language; twine: a portable, extensible sisal execution kernel; and investigating the memory performance of the optimizing sisal compiler.« less
Vascular system modeling in parallel environment - distributed and shared memory approaches

PubMed Central

Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne

2011-01-01

The paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages and therefore this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multi-core machines, show that both algorithms provide a significant speedup. PMID:21550891
Discrete-Slots Models of Visual Working-Memory Response Times

PubMed Central

Donkin, Christopher; Nosofsky, Robert M.; Gold, Jason M.; Shiffrin, Richard M.

2014-01-01

Much recent research has aimed to establish whether visual working memory (WM) is better characterized by a limited number of discrete all-or-none slots or by a continuous sharing of memory resources. To date, however, researchers have not considered the response-time (RT) predictions of discrete-slots versus shared-resources models. To complement the past research in this field, we formalize a family of mixed-state, discrete-slots models for explaining choice and RTs in tasks of visual WM change detection. In the tasks under investigation, a small set of visual items is presented, followed by a test item in 1 of the studied positions for which a change judgment must be made. According to the models, if the studied item in that position is retained in 1 of the discrete slots, then a memory-based evidence-accumulation process determines the choice and the RT; if the studied item in that position is missing, then a guessing-based accumulation process operates. Observed RT distributions are therefore theorized to arise as probabilistic mixtures of the memory-based and guessing distributions. We formalize an analogous set of continuous shared-resources models. The model classes are tested on individual subjects with both qualitative contrasts and quantitative fits to RT-distribution data. The discrete-slots models provide much better qualitative and quantitative accounts of the RT and choice data than do the shared-resources models, although there is some evidence for “slots plus resources” when memory set size is very small. PMID:24015956
NavP: Structured and Multithreaded Distributed Parallel Programming

NASA Technical Reports Server (NTRS)

Pan, Lei

2007-01-01

We present Navigational Programming (NavP) -- a distributed parallel programming methodology based on the principles of migrating computations and multithreading. The four major steps of NavP are: (1) Distribute the data using the data communication pattern in a given algorithm; (2) Insert navigational commands for the computation to migrate and follow large-sized distributed data; (3) Cut the sequential migrating thread and construct a mobile pipeline; and (4) Loop back for refinement. NavP is significantly different from the current prevailing Message Passing (MP) approach. The advantages of NavP include: (1) NavP is structured distributed programming and it does not change the code structure of an original algorithm. This is in sharp contrast to MP as MP implementations in general do not resemble the original sequential code; (2) NavP implementations are always competitive with the best MPI implementations in terms of performance. Approaches such as DSM or HPF have failed to deliver satisfying performance as of today in contrast, even if they are relatively easy to use compared to MP; (3) NavP provides incremental parallelization, which is beyond the reach of MP; and (4) NavP is a unifying approach that allows us to exploit both fine- (multithreading on shared memory) and coarse- (pipelined tasks on distributed memory) grained parallelism. This is in contrast to the currently popular hybrid use of MP+OpenMP, which is known to be complex to use. We present experimental results that demonstrate the effectiveness of NavP.
An enhanced Ada run-time system for real-time embedded processors

NASA Technical Reports Server (NTRS)

Sims, J. T.

1991-01-01

An enhanced Ada run-time system has been developed to support real-time embedded processor applications. The primary focus of this development effort has been on the tasking system and the memory management facilities of the run-time system. The tasking system has been extended to support efficient and precise periodic task execution as required for control applications. Event-driven task execution providing a means of task-asynchronous control and communication among Ada tasks is supported in this system. Inter-task control is even provided among tasks distributed on separate physical processors. The memory management system has been enhanced to provide object allocation and protected access support for memory shared between disjoint processors, each of which is executing a distinct Ada program.
Parallel processing for scientific computations

NASA Technical Reports Server (NTRS)

Alkhatib, Hasan S.

1995-01-01

The scope of this project dealt with the investigation of the requirements to support distributed computing of scientific computations over a cluster of cooperative workstations. Various experiments on computations for the solution of simultaneous linear equations were performed in the early phase of the project to gain experience in the general nature and requirements of scientific applications. A specification of a distributed integrated computing environment, DICE, based on a distributed shared memory communication paradigm has been developed and evaluated. The distributed shared memory model facilitates porting existing parallel algorithms that have been designed for shared memory multiprocessor systems to the new environment. The potential of this new environment is to provide supercomputing capability through the utilization of the aggregate power of workstations cooperating in a cluster interconnected via a local area network. Workstations, generally, do not have the computing power to tackle complex scientific applications, making them primarily useful for visualization, data reduction, and filtering as far as complex scientific applications are concerned. There is a tremendous amount of computing power that is left unused in a network of workstations. Very often a workstation is simply sitting idle on a desk. A set of tools can be developed to take advantage of this potential computing power to create a platform suitable for large scientific computations. The integration of several workstations into a logical cluster of distributed, cooperative, computing stations presents an alternative to shared memory multiprocessor systems. In this project we designed and evaluated such a system.
Memory Network For Distributed Data Processors

NASA Technical Reports Server (NTRS)

Bolen, David; Jensen, Dean; Millard, ED; Robinson, Dave; Scanlon, George

1992-01-01

Universal Memory Network (UMN) is modular, digital data-communication system enabling computers with differing bus architectures to share 32-bit-wide data between locations up to 3 km apart with less than one millisecond of latency. Makes it possible to design sophisticated real-time and near-real-time data-processing systems without data-transfer "bottlenecks". This enterprise network permits transmission of volume of data equivalent to an encyclopedia each second. Facilities benefiting from Universal Memory Network include telemetry stations, simulation facilities, power-plants, and large laboratories or any facility sharing very large volumes of data. Main hub of UMN is reflection center including smaller hubs called Shared Memory Interfaces.
Scheduling for Locality in Shared-Memory Multiprocessors

DTIC Science & Technology

1993-05-01

Submitted in Partial Fulfillment of the Requirements for the Degree ’)iIC Q(JALfryT INSPECTED 5 DOCTOR OF PHILOSOPHY I Accesion For Supervised by NTIS CRAM... architecture on parallel program performance, explain the implications of this trend on popular parallel programming models, and propose system software to 0...decomoosition and scheduling algorithms. I. SUIUECT TERMS IS. NUMBER OF PAGES shared-memory multiprocessors; architecture trends; loop 110 scheduling
Rapid solution of large-scale systems of equations

NASA Technical Reports Server (NTRS)

Storaasli, Olaf O.

1994-01-01

The analysis and design of complex aerospace structures requires the rapid solution of large systems of linear and nonlinear equations, eigenvalue extraction for buckling, vibration and flutter modes, structural optimization and design sensitivity calculation. Computers with multiple processors and vector capabilities can offer substantial computational advantages over traditional scalar computer for these analyses. These computers fall into two categories: shared memory computers and distributed memory computers. This presentation covers general-purpose, highly efficient algorithms for generation/assembly or element matrices, solution of systems of linear and nonlinear equations, eigenvalue and design sensitivity analysis and optimization. All algorithms are coded in FORTRAN for shared memory computers and many are adapted to distributed memory computers. The capability and numerical performance of these algorithms will be addressed.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Sayan Ghosh, Jeff Hammond

OpenSHMEM is a community effort to unifyt and standardize the SHMEM programming model. MPI (Message Passing Interface) is a well-known community standard for parallel programming using distributed memory. The most recen t release of MPI, version 3.0, was designed in part to support programming models like SHMEM.OSHMPI is an implementation of the OpenSHMEM standard using MPI-3 for the Linux operating system. It is the first implementation of SHMEM over MPI one-sided communication and has the potential to be widely adopted due to the portability and widely availability of Linux and MPI-3. OSHMPI has been tested on a variety of systemsmore » and implementations of MPI-3, includingInfiniBand clusters using MVAPICH2 and SGI shared-memory supercomputers using MPICH. Current support is limited to Linux but may be extended to Apple OSX if there is sufficient interest. The code is opensource via https://github.com/jeffhammond/oshmpi« less
Multiprocessor shared-memory information exchange

DOE Office of Scientific and Technical Information (OSTI.GOV)

Santoline, L.L.; Bowers, M.D.; Crew, A.W.

1989-02-01

In distributed microprocessor-based instrumentation and control systems, the inter-and intra-subsystem communication requirements ultimately form the basis for the overall system architecture. This paper describes a software protocol which addresses the intra-subsystem communications problem. Specifically the protocol allows for multiple processors to exchange information via a shared-memory interface. The authors primary goal is to provide a reliable means for information to be exchanged between central application processor boards (masters) and dedicated function processor boards (slaves) in a single computer chassis. The resultant Multiprocessor Shared-Memory Information Exchange (MSMIE) protocol, a standard master-slave shared-memory interface suitable for use in nuclear safety systems, ismore » designed to pass unidirectional buffers of information between the processors while providing a minimum, deterministic cycle time for this data exchange.« less
Hybrid MPI+OpenMP Programming of an Overset CFD Solver and Performance Investigations

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Jin, Haoqiang H.; Biegel, Bryan (Technical Monitor)

2002-01-01

This report describes a two level parallelization of a Computational Fluid Dynamic (CFD) solver with multi-zone overset structured grids. The approach is based on a hybrid MPI+OpenMP programming model suitable for shared memory and clusters of shared memory machines. The performance investigations of the hybrid application on an SGI Origin2000 (O2K) machine is reported using medium and large scale test problems.
Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures

NASA Technical Reports Server (NTRS)

Biegel, Bryan A. (Technical Monitor); Jost, G.; Jin, H.; Labarta J.; Gimenez, J.; Caubet, J.

2003-01-01

Parallel programming paradigms include process level parallelism, thread level parallelization, and multilevel parallelism. This viewgraph presentation describes a detailed performance analysis of these paradigms for Shared Memory Architecture (SMA). This analysis uses the Paraver Performance Analysis System. The presentation includes diagrams of a flow of useful computations.
Shared prefetching to reduce execution skew in multi-threaded systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Eichenberger, Alexandre E; Gunnels, John A

Mechanisms are provided for optimizing code to perform prefetching of data into a shared memory of a computing device that is shared by a plurality of threads that execute on the computing device. A memory stream of a portion of code that is shared by the plurality of threads is identified. A set of prefetch instructions is distributed across the plurality of threads. Prefetch instructions are inserted into the instruction sequences of the plurality of threads such that each instruction sequence has a separate sub-portion of the set of prefetch instructions, thereby generating optimized code. Executable code is generated basedmore » on the optimized code and stored in a storage device. The executable code, when executed, performs the prefetches associated with the distributed set of prefetch instructions in a shared manner across the plurality of threads.« less
Technical support for digital systems technology development. Task order 1: ISP contention analysis and control

NASA Technical Reports Server (NTRS)

Stehle, Roy H.; Ogier, Richard G.

1993-01-01

Alternatives for realizing a packet-based network switch for use on a frequency division multiple access/time division multiplexed (FDMA/TDM) geostationary communication satellite were investigated. Each of the eight downlink beams supports eight directed dwells. The design needed to accommodate multicast packets with very low probability of loss due to contention. Three switch architectures were designed and analyzed. An output-queued, shared bus system yielded a functionally simple system, utilizing a first-in, first-out (FIFO) memory per downlink dwell, but at the expense of a large total memory requirement. A shared memory architecture offered the most efficiency in memory requirements, requiring about half the memory of the shared bus design. The processing requirement for the shared-memory system adds system complexity that may offset the benefits of the smaller memory. An alternative design using a shared memory buffer per downlink beam decreases circuit complexity through a distributed design, and requires at most 1000 packets of memory more than the completely shared memory design. Modifications to the basic packet switch designs were proposed to accommodate circuit-switched traffic, which must be served on a periodic basis with minimal delay. Methods for dynamically controlling the downlink dwell lengths were developed and analyzed. These methods adapt quickly to changing traffic demands, and do not add significant complexity or cost to the satellite and ground station designs. Methods for reducing the memory requirement by not requiring the satellite to store full packets were also proposed and analyzed. In addition, optimal packet and dwell lengths were computed as functions of memory size for the three switch architectures.
PANDA: A distributed multiprocessor operating system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chubb, P.

1989-01-01

PANDA is a design for a distributed multiprocessor and an operating system. PANDA is designed to allow easy expansion of both hardware and software. As such, the PANDA kernel provides only message passing and memory and process management. The other features needed for the system (device drivers, secondary storage management, etc.) are provided as replaceable user tasks. The thesis presents PANDA's design and implementation, both hardware and software. PANDA uses multiple 68010 processors sharing memory on a VME bus, each such node potentially connected to others via a high speed network. The machine is completely homogeneous: there are no differencesmore » between processors that are detectable by programs running on the machine. A single two-processor node has been constructed. Each processor contains memory management circuits designed to allow processors to share page tables safely. PANDA presents a programmers' model similar to the hardware model: a job is divided into multiple tasks, each having its own address space. Within each task, multiple processes share code and data. Tasks can send messages to each other, and set up virtual circuits between themselves. Peripheral devices such as disc drives are represented within PANDA by tasks. PANDA divides secondary storage into volumes, each volume being accessed by a volume access task, or VAT. All knowledge about the way that data is stored on a disc is kept in its volume's VAT. The design is such that PANDA should provide a useful testbed for file systems and device drivers, as these can be installed without recompiling PANDA itself, and without rebooting the machine.« less
Spaceborne Processor Array

NASA Technical Reports Server (NTRS)

Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

2008-01-01

A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.
Experiences using OpenMP based on Computer Directed Software DSM on a PC Cluster

NASA Technical Reports Server (NTRS)

Hess, Matthias; Jost, Gabriele; Mueller, Matthias; Ruehle, Roland

2003-01-01

In this work we report on our experiences running OpenMP programs on a commodity cluster of PCs running a software distributed shared memory (DSM) system. We describe our test environment and report on the performance of a subset of the NAS Parallel Benchmarks that have been automaticaly parallelized for OpenMP. We compare the performance of the OpenMP implementations with that of their message passing counterparts and discuss performance differences.
Testing New Programming Paradigms with NAS Parallel Benchmarks

NASA Technical Reports Server (NTRS)

Jin, H.; Frumkin, M.; Schultz, M.; Yan, J.

2000-01-01

Over the past decade, high performance computing has evolved rapidly, not only in hardware architectures but also with increasing complexity of real applications. Technologies have been developing to aim at scaling up to thousands of processors on both distributed and shared memory systems. Development of parallel programs on these computers is always a challenging task. Today, writing parallel programs with message passing (e.g. MPI) is the most popular way of achieving scalability and high performance. However, writing message passing programs is difficult and error prone. Recent years new effort has been made in defining new parallel programming paradigms. The best examples are: HPF (based on data parallelism) and OpenMP (based on shared memory parallelism). Both provide simple and clear extensions to sequential programs, thus greatly simplify the tedious tasks encountered in writing message passing programs. HPF is independent of memory hierarchy, however, due to the immaturity of compiler technology its performance is still questionable. Although use of parallel compiler directives is not new, OpenMP offers a portable solution in the shared-memory domain. Another important development involves the tremendous progress in the internet and its associated technology. Although still in its infancy, Java promisses portability in a heterogeneous environment and offers possibility to "compile once and run anywhere." In light of testing these new technologies, we implemented new parallel versions of the NAS Parallel Benchmarks (NPBs) with HPF and OpenMP directives, and extended the work with Java and Java-threads. The purpose of this study is to examine the effectiveness of alternative programming paradigms. NPBs consist of five kernels and three simulated applications that mimic the computation and data movement of large scale computational fluid dynamics (CFD) applications. We started with the serial version included in NPB2.3. Optimization of memory and cache usage was applied to several benchmarks, noticeably BT and SP, resulting in better sequential performance. In order to overcome the lack of an HPF performance model and guide the development of the HPF codes, we employed an empirical performance model for several primitives found in the benchmarks. We encountered a few limitations of HPF, such as lack of supporting the "REDISTRIBUTION" directive and no easy way to handle irregular computation. The parallelization with OpenMP directives was done at the outer-most loop level to achieve the largest granularity. The performance of six HPF and OpenMP benchmarks is compared with their MPI counterparts for the Class-A problem size in the figure in next page. These results were obtained on an SGI Origin2000 (195MHz) with MIPSpro-f77 compiler 7.2.1 for OpenMP and MPI codes and PGI pghpf-2.4.3 compiler with MPI interface for HPF programs.

Memory T and memory B cells share a transcriptional program of self-renewal with long-term hematopoietic stem cells

PubMed Central

Luckey, Chance John; Bhattacharya, Deepta; Goldrath, Ananda W.; Weissman, Irving L.; Benoist, Christophe; Mathis, Diane

2006-01-01

The only cells of the hematopoietic system that undergo self-renewal for the lifetime of the organism are long-term hematopoietic stem cells and memory T and B cells. To determine whether there is a shared transcriptional program among these self-renewing populations, we first compared the gene-expression profiles of naïve, effector and memory CD8+ T cells with those of long-term hematopoietic stem cells, short-term hematopoietic stem cells, and lineage-committed progenitors. Transcripts augmented in memory CD8+ T cells relative to naïve and effector T cells were selectively enriched in long-term hematopoietic stem cells and were progressively lost in their short-term and lineage-committed counterparts. Furthermore, transcripts selectively decreased in memory CD8+ T cells were selectively down-regulated in long-term hematopoietic stem cells and progressively increased with differentiation. To confirm that this pattern was a general property of immunologic memory, we turned to independently generated gene expression profiles of memory, naïve, germinal center, and plasma B cells. Once again, memory-enriched and -depleted transcripts were also appropriately augmented and diminished in long-term hematopoietic stem cells, and their expression correlated with progressive loss of self-renewal function. Thus, there appears to be a common signature of both up- and down-regulated transcripts shared between memory T cells, memory B cells, and long-term hematopoietic stem cells. This signature was not consistently enriched in neural or embryonic stem cell populations and, therefore, appears to be restricted to the hematopoeitic system. These observations provide evidence that the shared phenotype of self-renewal in the hematopoietic system is linked at the molecular level. PMID:16492737
Performance and Application of Parallel OVERFLOW Codes on Distributed and Shared Memory Platforms

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Rizk, Yehia M.

1999-01-01

The presentation discusses recent studies on the performance of the two parallel versions of the aerodynamics CFD code, OVERFLOW_MPI and _MLP. Developed at NASA Ames, the serial version, OVERFLOW, is a multidimensional Navier-Stokes flow solver based on overset (Chimera) grid technology. The code has recently been parallelized in two ways. One is based on the explicit message-passing interface (MPI) across processors and uses the _MPI communication package. This approach is primarily suited for distributed memory systems and workstation clusters. The second, termed the multi-level parallel (MLP) method, is simple and uses shared memory for all communications. The _MLP code is suitable on distributed-shared memory systems. For both methods, the message passing takes place across the processors or processes at the advancement of each time step. This procedure is, in effect, the Chimera boundary conditions update, which is done in an explicit "Jacobi" style. In contrast, the update in the serial code is done in more of the "Gauss-Sidel" fashion. The programming efforts for the _MPI code is more complicated than for the _MLP code; the former requires modification of the outer and some inner shells of the serial code, whereas the latter focuses only on the outer shell of the code. The _MPI version offers a great deal of flexibility in distributing grid zones across a specified number of processors in order to achieve load balancing. The approach is capable of partitioning zones across multiple processors or sending each zone and/or cluster of several zones into a single processor. The message passing across the processors consists of Chimera boundary and/or an overlap of "halo" boundary points for each partitioned zone. The MLP version is a new coarse-grain parallel concept at the zonal and intra-zonal levels. A grouping strategy is used to distribute zones into several groups forming sub-processes which will run in parallel. The total volume of grid points in each group are approximately balanced. A proper number of threads are initially allocated to each group, and in subsequent iterations during the run-time, the number of threads are adjusted to achieve load balancing across the processes. Each process exploits the multitasking directives already established in Overflow.
SKIRT: Hybrid parallelization of radiative transfer simulations

NASA Astrophysics Data System (ADS)

Verstocken, S.; Van De Putte, D.; Camps, P.; Baes, M.

2017-07-01

We describe the design, implementation and performance of the new hybrid parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which has been used extensively for modelling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori. The hybrid scheme combines distributed memory parallelization, using the standard Message Passing Interface (MPI) to communicate between processes, and shared memory parallelization, providing multiple execution threads within each process to avoid duplication of data structures. The synchronization between multiple threads is accomplished through atomic operations without high-level locking (also called lock-free programming). This improves the scaling behaviour of the code and substantially simplifies the implementation of the hybrid scheme. The result is an extremely flexible solution that adjusts to the number of available nodes, processors and memory, and consequently performs well on a wide variety of computing architectures.
Experiences Using OpenMP Based on Compiler Directed Software DSM on a PC Cluster

NASA Technical Reports Server (NTRS)

Hess, Matthias; Jost, Gabriele; Mueller, Matthias; Ruehle, Roland; Biegel, Bryan (Technical Monitor)

2002-01-01

In this work we report on our experiences running OpenMP (message passing) programs on a commodity cluster of PCs (personal computers) running a software distributed shared memory (DSM) system. We describe our test environment and report on the performance of a subset of the NAS (NASA Advanced Supercomputing) Parallel Benchmarks that have been automatically parallelized for OpenMP. We compare the performance of the OpenMP implementations with that of their message passing counterparts and discuss performance differences.
System and method for programmable bank selection for banked memory subsystems

DOEpatents

Blumrich, Matthias A.; Chen, Dong; Gara, Alan G.; Giampapa, Mark E.; Hoenicke, Dirk; Ohmacht, Martin; Salapura, Valentina; Sugavanam, Krishnan

2010-09-07

A programmable memory system and method for enabling one or more processor devices access to shared memory in a computing environment, the shared memory including one or more memory storage structures having addressable locations for storing data. The system comprises: one or more first logic devices associated with a respective one or more processor devices, each first logic device for receiving physical memory address signals and programmable for generating a respective memory storage structure select signal upon receipt of pre-determined address bit values at selected physical memory address bit locations; and, a second logic device responsive to each of the respective select signal for generating an address signal used for selecting a memory storage structure for processor access. The system thus enables each processor device of a computing environment memory storage access distributed across the one or more memory storage structures.
Distributed-Memory Fast Maximal Independent Set

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kanewala Appuhamilage, Thejaka Amila J.; Zalewski, Marcin J.; Lumsdaine, Andrew

The Maximal Independent Set (MIS) graph problem arises in many applications such as computer vision, information theory, molecular biology, and process scheduling. The growing scale of MIS problems suggests the use of distributed-memory hardware as a cost-effective approach to providing necessary compute and memory resources. Luby proposed four randomized algorithms to solve the MIS problem. All those algorithms are designed focusing on shared-memory machines and are analyzed using the PRAM model. These algorithms do not have direct efficient distributed-memory implementations. In this paper, we extend two of Luby’s seminal MIS algorithms, “Luby(A)” and “Luby(B),” to distributed-memory execution, and we evaluatemore » their performance. We compare our results with the “Filtered MIS” implementation in the Combinatorial BLAS library for two types of synthetic graph inputs.« less
Parallelization and automatic data distribution for nuclear reactor simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Liebrock, L.M.

1997-07-01

Detailed attempts at realistic nuclear reactor simulations currently take many times real time to execute on high performance workstations. Even the fastest sequential machine can not run these simulations fast enough to ensure that the best corrective measure is used during a nuclear accident to prevent a minor malfunction from becoming a major catastrophe. Since sequential computers have nearly reached the speed of light barrier, these simulations will have to be run in parallel to make significant improvements in speed. In physical reactor plants, parallelism abounds. Fluids flow, controls change, and reactions occur in parallel with only adjacent components directlymore » affecting each other. These do not occur in the sequentialized manner, with global instantaneous effects, that is often used in simulators. Development of parallel algorithms that more closely approximate the real-world operation of a reactor may, in addition to speeding up the simulations, actually improve the accuracy and reliability of the predictions generated. Three types of parallel architecture (shared memory machines, distributed memory multicomputers, and distributed networks) are briefly reviewed as targets for parallelization of nuclear reactor simulation. Various parallelization models (loop-based model, shared memory model, functional model, data parallel model, and a combined functional and data parallel model) are discussed along with their advantages and disadvantages for nuclear reactor simulation. A variety of tools are introduced for each of the models. Emphasis is placed on the data parallel model as the primary focus for two-phase flow simulation. Tools to support data parallel programming for multiple component applications and special parallelization considerations are also discussed.« less
NAS Parallel Benchmark. Results 11-96: Performance Comparison of HPF and MPI Based NAS Parallel Benchmarks. 1.0

NASA Technical Reports Server (NTRS)

Saini, Subash; Bailey, David; Chancellor, Marisa K. (Technical Monitor)

1997-01-01

High Performance Fortran (HPF), the high-level language for parallel Fortran programming, is based on Fortran 90. HALF was defined by an informal standards committee known as the High Performance Fortran Forum (HPFF) in 1993, and modeled on TMC's CM Fortran language. Several HPF features have since been incorporated into the draft ANSI/ISO Fortran 95, the next formal revision of the Fortran standard. HPF allows users to write a single parallel program that can execute on a serial machine, a shared-memory parallel machine, or a distributed-memory parallel machine. HPF eliminates the complex, error-prone task of explicitly specifying how, where, and when to pass messages between processors on distributed-memory machines, or when to synchronize processors on shared-memory machines. HPF is designed in a way that allows the programmer to code an application at a high level, and then selectively optimize portions of the code by dropping into message-passing or calling tuned library routines as 'extrinsics'. Compilers supporting High Performance Fortran features first appeared in late 1994 and early 1995 from Applied Parallel Research (APR) Digital Equipment Corporation, and The Portland Group (PGI). IBM introduced an HPF compiler for the IBM RS/6000 SP/2 in April of 1996. Over the past two years, these implementations have shown steady improvement in terms of both features and performance. The performance of various hardware/ programming model (HPF and MPI (message passing interface)) combinations will be compared, based on latest NAS (NASA Advanced Supercomputing) Parallel Benchmark (NPB) results, thus providing a cross-machine and cross-model comparison. Specifically, HPF based NPB results will be compared with MPI based NPB results to provide perspective on performance currently obtainable using HPF versus MPI or versus hand-tuned implementations such as those supplied by the hardware vendors. In addition we would also present NPB (Version 1.0) performance results for the following systems: DEC Alpha Server 8400 5/440, Fujitsu VPP Series (VX, VPP300, and VPP700), HP/Convex Exemplar SPP2000, IBM RS/6000 SP P2SC node (120 MHz) NEC SX-4/32, SGI/CRAY T3E, SGI Origin2000.
Frequent Statement and Dereference Elimination for Imperative and Object-Oriented Distributed Programs

PubMed Central

El-Zawawy, Mohamed A.

2014-01-01

This paper introduces new approaches for the analysis of frequent statement and dereference elimination for imperative and object-oriented distributed programs running on parallel machines equipped with hierarchical memories. The paper uses languages whose address spaces are globally partitioned. Distributed programs allow defining data layout and threads writing to and reading from other thread memories. Three type systems (for imperative distributed programs) are the tools of the proposed techniques. The first type system defines for every program point a set of calculated (ready) statements and memory accesses. The second type system uses an enriched version of types of the first type system and determines which of the ready statements and memory accesses are used later in the program. The third type system uses the information gather so far to eliminate unnecessary statement computations and memory accesses (the analysis of frequent statement and dereference elimination). Extensions to these type systems are also presented to cover object-oriented distributed programs. Two advantages of our work over related work are the following. The hierarchical style of concurrent parallel computers is similar to the memory model used in this paper. In our approach, each analysis result is assigned a type derivation (serves as a correctness proof). PMID:24892098
The Automatic Parallelisation of Scientific Application Codes Using a Computer Aided Parallelisation Toolkit

NASA Technical Reports Server (NTRS)

Ierotheou, C.; Johnson, S.; Leggett, P.; Cross, M.; Evans, E.; Jin, Hao-Qiang; Frumkin, M.; Yan, J.; Biegel, Bryan (Technical Monitor)

2001-01-01

The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. Historically, the lack of a programming standard for using directives and the rather limited performance due to scalability have affected the take-up of this programming model approach. Significant progress has been made in hardware and software technologies, as a result the performance of parallel programs with compiler directives has also made improvements. The introduction of an industrial standard for shared-memory programming with directives, OpenMP, has also addressed the issue of portability. In this study, we have extended the computer aided parallelization toolkit (developed at the University of Greenwich), to automatically generate OpenMP based parallel programs with nominal user assistance. We outline the way in which loop types are categorized and how efficient OpenMP directives can be defined and placed using the in-depth interprocedural analysis that is carried out by the toolkit. We also discuss the application of the toolkit on the NAS Parallel Benchmarks and a number of real-world application codes. This work not only demonstrates the great potential of using the toolkit to quickly parallelize serial programs but also the good performance achievable on up to 300 processors for hybrid message passing and directive-based parallelizations.
Shared Memory Parallelization of an Implicit ADI-type CFD Code

NASA Technical Reports Server (NTRS)

Hauser, Th.; Huang, P. G.

1999-01-01

A parallelization study designed for ADI-type algorithms is presented using the OpenMP specification for shared-memory multiprocessor programming. Details of optimizations specifically addressed to cache-based computer architectures are described and performance measurements for the single and multiprocessor implementation are summarized. The paper demonstrates that optimization of memory access on a cache-based computer architecture controls the performance of the computational algorithm. A hybrid MPI/OpenMP approach is proposed for clusters of shared memory machines to further enhance the parallel performance. The method is applied to develop a new LES/DNS code, named LESTool. A preliminary DNS calculation of a fully developed channel flow at a Reynolds number of 180, Re(sub tau) = 180, has shown good agreement with existing data.
Parallel Computation of the Regional Ocean Modeling System (ROMS)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wang, P; Song, Y T; Chao, Y

2005-04-05

The Regional Ocean Modeling System (ROMS) is a regional ocean general circulation modeling system solving the free surface, hydrostatic, primitive equations over varying topography. It is free software distributed world-wide for studying both complex coastal ocean problems and the basin-to-global scale ocean circulation. The original ROMS code could only be run on shared-memory systems. With the increasing need to simulate larger model domains with finer resolutions and on a variety of computer platforms, there is a need in the ocean-modeling community to have a ROMS code that can be run on any parallel computer ranging from 10 to hundreds ofmore » processors. Recently, we have explored parallelization for ROMS using the MPI programming model. In this paper, an efficient parallelization strategy for such a large-scale scientific software package, based on an existing shared-memory computing model, is presented. In addition, scientific applications and data-performance issues on a couple of SGI systems, including Columbia, the world's third-fastest supercomputer, are discussed.« less
Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jin, Shuangshuang; Chen, Yousu; Wu, Di

2015-12-09

Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Messagemore » Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.« less
Automation of Data Traffic Control on DSM Architecture

NASA Technical Reports Server (NTRS)

Frumkin, Michael; Jin, Hao-Qiang; Yan, Jerry

2001-01-01

The design of distributed shared memory (DSM) computers liberates users from the duty to distribute data across processors and allows for the incremental development of parallel programs using, for example, OpenMP or Java threads. DSM architecture greatly simplifies the development of parallel programs having good performance on a few processors. However, to achieve a good program scalability on DSM computers requires that the user understand data flow in the application and use various techniques to avoid data traffic congestions. In this paper we discuss a number of such techniques, including data blocking, data placement, data transposition and page size control and evaluate their efficiency on the NAS (NASA Advanced Supercomputing) Parallel Benchmarks. We also present a tool which automates the detection of constructs causing data congestions in Fortran array oriented codes and advises the user on code transformations for improving data traffic in the application.
[Series: Medical Applications of the PHITS Code (2): Acceleration by Parallel Computing].

PubMed

Furuta, Takuya; Sato, Tatsuhiko

2015-01-01

Time-consuming Monte Carlo dose calculation becomes feasible owing to the development of computer technology. However, the recent development is due to emergence of the multi-core high performance computers. Therefore, parallel computing becomes a key to achieve good performance of software programs. A Monte Carlo simulation code PHITS contains two parallel computing functions, the distributed-memory parallelization using protocols of message passing interface (MPI) and the shared-memory parallelization using open multi-processing (OpenMP) directives. Users can choose the two functions according to their needs. This paper gives the explanation of the two functions with their advantages and disadvantages. Some test applications are also provided to show their performance using a typical multi-core high performance workstation.
Avoiding and tolerating latency in large-scale next-generation shared-memory multiprocessors

NASA Technical Reports Server (NTRS)

Probst, David K.

1993-01-01

A scalable solution to the memory-latency problem is necessary to prevent the large latencies of synchronization and memory operations inherent in large-scale shared-memory multiprocessors from reducing high performance. We distinguish latency avoidance and latency tolerance. Latency is avoided when data is brought to nearby locales for future reference. Latency is tolerated when references are overlapped with other computation. Latency-avoiding locales include: processor registers, data caches used temporally, and nearby memory modules. Tolerating communication latency requires parallelism, allowing the overlap of communication and computation. Latency-tolerating techniques include: vector pipelining, data caches used spatially, prefetching in various forms, and multithreading in various forms. Relaxing the consistency model permits increased use of avoidance and tolerance techniques. Each model is a mapping from the program text to sets of partial orders on program operations; it is a convention about which temporal precedences among program operations are necessary. Information about temporal locality and parallelism constrains the use of avoidance and tolerance techniques. Suitable architectural primitives and compiler technology are required to exploit the increased freedom to reorder and overlap operations in relaxed models.
A numerical differentiation library exploiting parallel architectures

NASA Astrophysics Data System (ADS)

Voglis, C.; Hadjidoukas, P. E.; Lagaris, I. E.; Papageorgiou, D. G.

2009-08-01

We present a software library for numerically estimating first and second order partial derivatives of a function by finite differencing. Various truncation schemes are offered resulting in corresponding formulas that are accurate to order O(h), O(h), and O(h), h being the differencing step. The derivatives are calculated via forward, backward and central differences. Care has been taken that only feasible points are used in the case where bound constraints are imposed on the variables. The Hessian may be approximated either from function or from gradient values. There are three versions of the software: a sequential version, an OpenMP version for shared memory architectures and an MPI version for distributed systems (clusters). The parallel versions exploit the multiprocessing capability offered by computer clusters, as well as modern multi-core systems and due to the independent character of the derivative computation, the speedup scales almost linearly with the number of available processors/cores. Program summaryProgram title: NDL (Numerical Differentiation Library) Catalogue identifier: AEDG_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDG_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 73 030 No. of bytes in distributed program, including test data, etc.: 630 876 Distribution format: tar.gz Programming language: ANSI FORTRAN-77, ANSI C, MPI, OPENMP Computer: Distributed systems (clusters), shared memory systems Operating system: Linux, Solaris Has the code been vectorised or parallelized?: Yes RAM: The library uses O(N) internal storage, N being the dimension of the problem Classification: 4.9, 4.14, 6.5 Nature of problem: The numerical estimation of derivatives at several accuracy levels is a common requirement in many computational tasks, such as optimization, solution of nonlinear systems, etc. The parallel implementation that exploits systems with multiple CPUs is very important for large scale and computationally expensive problems. Solution method: Finite differencing is used with carefully chosen step that minimizes the sum of the truncation and round-off errors. The parallel versions employ both OpenMP and MPI libraries. Restrictions: The library uses only double precision arithmetic. Unusual features: The software takes into account bound constraints, in the sense that only feasible points are used to evaluate the derivatives, and given the level of the desired accuracy, the proper formula is automatically employed. Running time: Running time depends on the function's complexity. The test run took 15 ms for the serial distribution, 0.6 s for the OpenMP and 4.2 s for the MPI parallel distribution on 2 processors.
Explicit time integration of finite element models on a vectorized, concurrent computer with shared memory

NASA Technical Reports Server (NTRS)

Gilbertsen, Noreen D.; Belytschko, Ted

1990-01-01

The implementation of a nonlinear explicit program on a vectorized, concurrent computer with shared memory is described and studied. The conflict between vectorization and concurrency is described and some guidelines are given for optimal block sizes. Several example problems are summarized to illustrate the types of speed-ups which can be achieved by reprogramming as compared to compiler optimization.
Multiprocessor architecture: Synthesis and evaluation

NASA Technical Reports Server (NTRS)

Standley, Hilda M.

1990-01-01

Multiprocessor computed architecture evaluation for structural computations is the focus of the research effort described. Results obtained are expected to lead to more efficient use of existing architectures and to suggest designs for new, application specific, architectures. The brief descriptions given outline a number of related efforts directed toward this purpose. The difficulty is analyzing an existing architecture or in designing a new computer architecture lies in the fact that the performance of a particular architecture, within the context of a given application, is determined by a number of factors. These include, but are not limited to, the efficiency of the computation algorithm, the programming language and support environment, the quality of the program written in the programming language, the multiplicity of the processing elements, the characteristics of the individual processing elements, the interconnection network connecting processors and non-local memories, and the shared memory organization covering the spectrum from no shared memory (all local memory) to one global access memory. These performance determiners may be loosely classified as being software or hardware related. This distinction is not clear or even appropriate in many cases. The effect of the choice of algorithm is ignored by assuming that the algorithm is specified as given. Effort directed toward the removal of the effect of the programming language and program resulted in the design of a high-level parallel programming language. Two characteristics of the fundamental structure of the architecture (memory organization and interconnection network) are examined.
Parallel discrete event simulation using shared memory

NASA Technical Reports Server (NTRS)

Reed, Daniel A.; Malony, Allen D.; Mccredie, Bradley D.

1988-01-01

With traditional event-list techniques, evaluating a detailed discrete-event simulation-model can often require hours or even days of computation time. By eliminating the event list and maintaining only sufficient synchronization to ensure causality, parallel simulation can potentially provide speedups that are linear in the numbers of processors. A set of shared-memory experiments, using the Chandy-Misra distributed-simulation algorithm, to simulate networks of queues is presented. Parameters of the study include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential-simulation of most queueing network models.

Parallelization of KENO-Va Monte Carlo code

NASA Astrophysics Data System (ADS)

Ramón, Javier; Peña, Jorge

1995-07-01

KENO-Va is a code integrated within the SCALE system developed by Oak Ridge that solves the transport equation through the Monte Carlo Method. It is being used at the Consejo de Seguridad Nuclear (CSN) to perform criticality calculations for fuel storage pools and shipping casks. Two parallel versions of the code: one for shared memory machines and other for distributed memory systems using the message-passing interface PVM have been generated. In both versions the neutrons of each generation are tracked in parallel. In order to preserve the reproducibility of the results in both versions, advanced seeds for random numbers were used. The CONVEX C3440 with four processors and shared memory at CSN was used to implement the shared memory version. A FDDI network of 6 HP9000/735 was employed to implement the message-passing version using proprietary PVM. The speedup obtained was 3.6 in both cases.
Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ibrahim, Khaled Z.; Epifanovsky, Evgeny; Williams, Samuel

Coupled-cluster methods provide highly accurate models of molecular structure through explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix–matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts tomore » extend the Libtensor framework to work in the distributed memory environment in a scalable and energy-efficient manner. We achieve up to 240× speedup compared with the optimized shared memory implementation of Libtensor. We attain scalability to hundreds of thousands of compute cores on three distributed-memory architectures (Cray XC30 and XC40, and IBM Blue Gene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance, tasking and bulk synchronous models. Nevertheless, we preserve a unified interface to both programming models to maintain the productivity of computational quantum chemists.« less
Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

DOE PAGES

Ibrahim, Khaled Z.; Epifanovsky, Evgeny; Williams, Samuel; ...

2017-03-08

Coupled-cluster methods provide highly accurate models of molecular structure through explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix–matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts tomore » extend the Libtensor framework to work in the distributed memory environment in a scalable and energy-efficient manner. We achieve up to 240× speedup compared with the optimized shared memory implementation of Libtensor. We attain scalability to hundreds of thousands of compute cores on three distributed-memory architectures (Cray XC30 and XC40, and IBM Blue Gene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance, tasking and bulk synchronous models. Nevertheless, we preserve a unified interface to both programming models to maintain the productivity of computational quantum chemists.« less
Execution time supports for adaptive scientific algorithms on distributed memory machines

NASA Technical Reports Server (NTRS)

Berryman, Harry; Saltz, Joel; Scroggs, Jeffrey

1990-01-01

Optimizations are considered that are required for efficient execution of code segments that consists of loops over distributed data structures. The PARTI (Parallel Automated Runtime Toolkit at ICASE) execution time primitives are designed to carry out these optimizations and can be used to implement a wide range of scientific algorithms on distributed memory machines. These primitives allow the user to control array mappings in a way that gives an appearance of shared memory. Computations can be based on a global index set. Primitives are used to carry out gather and scatter operations on distributed arrays. Communications patterns are derived at runtime, and the appropriate send and receive messages are automatically generated.
Location-Unbound Color-Shape Binding Representations in Visual Working Memory.

PubMed

Saiki, Jun

2016-02-01

The mechanism by which nonspatial features, such as color and shape, are bound in visual working memory, and the role of those features' location in their binding, remains unknown. In the current study, I modified a redundancy-gain paradigm to investigate these issues. A set of features was presented in a two-object memory display, followed by a single object probe. Participants judged whether the probe contained any features of the memory display, regardless of its location. Response time distributions revealed feature coactivation only when both features of a single object in the memory display appeared together in the probe, regardless of the response time benefit from the probe and memory objects sharing the same location. This finding suggests that a shared location is necessary in the formation of bound representations but unnecessary in their maintenance. Electroencephalography data showed that amplitude modulations reflecting location-unbound feature coactivation were different from those reflecting the location-sharing benefit, consistent with the behavioral finding that feature-location binding is unnecessary in the maintenance of color-shape binding. © The Author(s) 2015.
Conditional load and store in a shared memory

DOEpatents

Blumrich, Matthias A; Ohmacht, Martin

2015-02-03

A method, system and computer program product for implementing load-reserve and store-conditional instructions in a multi-processor computing system. The computing system includes a multitude of processor units and a shared memory cache, and each of the processor units has access to the memory cache. In one embodiment, the method comprises providing the memory cache with a series of reservation registers, and storing in these registers addresses reserved in the memory cache for the processor units as a result of issuing load-reserve requests. In this embodiment, when one of the processor units makes a request to store data in the memory cache using a store-conditional request, the reservation registers are checked to determine if an address in the memory cache is reserved for that processor unit. If an address in the memory cache is reserved for that processor, the data are stored at this address.
Distributed shared memory for roaming large volumes.

PubMed

Castanié, Laurent; Mion, Christophe; Cavin, Xavier; Lévy, Bruno

2006-01-01

We present a cluster-based volume rendering system for roaming very large volumes. This system allows to move a gigabyte-sized probe inside a total volume of several tens or hundreds of gigabytes in real-time. While the size of the probe is limited by the total amount of texture memory on the cluster, the size of the total data set has no theoretical limit. The cluster is used as a distributed graphics processing unit that both aggregates graphics power and graphics memory. A hardware-accelerated volume renderer runs in parallel on the cluster nodes and the final image compositing is implemented using a pipelined sort-last rendering algorithm. Meanwhile, volume bricking and volume paging allow efficient data caching. On each rendering node, a distributed hierarchical cache system implements a global software-based distributed shared memory on the cluster. In case of a cache miss, this system first checks page residency on the other cluster nodes instead of directly accessing local disks. Using two Gigabit Ethernet network interfaces per node, we accelerate data fetching by a factor of 4 compared to directly accessing local disks. The system also implements asynchronous disk access and texture loading, which makes it possible to overlap data loading, volume slicing and rendering for optimal volume roaming.
Two demonstrators and a simulator for a sparse, distributed memory

NASA Technical Reports Server (NTRS)

Brown, Robert L.

1987-01-01

Described are two programs demonstrating different aspects of Kanerva's Sparse, Distributed Memory (SDM). These programs run on Sun 3 workstations, one using color, and have straightforward graphically oriented user interfaces and graphical output. Presented are descriptions of the programs, how to use them, and what they show. Additionally, this paper describes the software simulator behind each program.
Reducing Interprocessor Dependence in Recoverable Distributed Shared Memory

NASA Technical Reports Server (NTRS)

Janssens, Bob; Fuchs, W. Kent

1994-01-01

Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensure that a system rolls back to a consistent state. Traditional dependency tracking in distributed shared memory (DSM) systems is expensive because of high communication frequency. In this paper we show that, if designed correctly, a DSM system only needs to consider dependencies due to the transfer of blocks of data, resulting in reduced dependency tracking overhead and reduced potential for rollback propagation. We develop an ownership timestamp scheme to tolerate the loss of block state information and develop a passive server model of execution where interactions between processors are considered atomic. With our scheme, dependencies are significantly reduced compared to the traditional message-passing model.
An experimental distributed microprocessor implementation with a shared memory communications and control medium

NASA Technical Reports Server (NTRS)

Mejzak, R. S.

1980-01-01

The distributed processing concept is defined in terms of control primitives, variables, and structures and their use in performing a decomposed discrete Fourier transform (DET) application function. The design assumes interprocessor communications to be anonymous. In this scheme, all processors can access an entire common database by employing control primitives. Access to selected areas within the common database is random, enforced by a hardware lock, and determined by task and subtask pointers. This enables the number of processors to be varied in the configuration without any modifications to the control structure. Decompositional elements of the DFT application function in terms of tasks and subtasks are also described. The experimental hardware configuration consists of IMSAI 8080 chassis which are independent, 8 bit microcomputer units. These chassis are linked together to form a multiple processing system by means of a shared memory facility. This facility consists of hardware which provides a bus structure to enable up to six microcomputers to be interconnected. It provides polling and arbitration logic so that only one processor has access to shared memory at any one time.
Running ATLAS workloads within massively parallel distributed applications using Athena Multi-Process framework (AthenaMP)

NASA Astrophysics Data System (ADS)

Calafiura, Paolo; Leggett, Charles; Seuster, Rolf; Tsulaia, Vakhtang; Van Gemmeren, Peter

2015-12-01

AthenaMP is a multi-process version of the ATLAS reconstruction, simulation and data analysis framework Athena. By leveraging Linux fork and copy-on-write mechanisms, it allows for sharing of memory pages between event processors running on the same compute node with little to no change in the application code. Originally targeted to optimize the memory footprint of reconstruction jobs, AthenaMP has demonstrated that it can reduce the memory usage of certain configurations of ATLAS production jobs by a factor of 2. AthenaMP has also evolved to become the parallel event-processing core of the recently developed ATLAS infrastructure for fine-grained event processing (Event Service) which allows the running of AthenaMP inside massively parallel distributed applications on hundreds of compute nodes simultaneously. We present the architecture of AthenaMP, various strategies implemented by AthenaMP for scheduling workload to worker processes (for example: Shared Event Queue and Shared Distributor of Event Tokens) and the usage of AthenaMP in the diversity of ATLAS event processing workloads on various computing resources: Grid, opportunistic resources and HPC.
Parallel performance investigations of an unstructured mesh Navier-Stokes solver

NASA Technical Reports Server (NTRS)

Mavriplis, Dimitri J.

2000-01-01

A Reynolds-averaged Navier-Stokes solver based on unstructured mesh techniques for analysis of high-lift configurations is described. The method makes use of an agglomeration multigrid solver for convergence acceleration. Implicit line-smoothing is employed to relieve the stiffness associated with highly stretched meshes. A GMRES technique is also implemented to speed convergence at the expense of additional memory usage. The solver is cache efficient and fully vectorizable, and is parallelized using a two-level hybrid MPI-OpenMP implementation suitable for shared and/or distributed memory architectures, as well as clusters of shared memory machines. Convergence and scalability results are illustrated for various high-lift cases.
Hierarchical resilience with lightweight threads.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wheeler, Kyle Bruce

2011-10-01

This paper proposes methodology for providing robustness and resilience for a highly threaded distributed- and shared-memory environment based on well-defined inputs and outputs to lightweight tasks. These inputs and outputs form a failure 'barrier', allowing tasks to be restarted or duplicated as necessary. These barriers must be expanded based on task behavior, such as communication between tasks, but do not prohibit any given behavior. One of the trends in high-performance computing codes seems to be a trend toward self-contained functions that mimic functional programming. Software designers are trending toward a model of software design where their core functions are specifiedmore » in side-effect free or low-side-effect ways, wherein the inputs and outputs of the functions are well-defined. This provides the ability to copy the inputs to wherever they need to be - whether that's the other side of the PCI bus or the other side of the network - do work on that input using local memory, and then copy the outputs back (as needed). This design pattern is popular among new distributed threading environment designs. Such designs include the Barcelona STARS system, distributed OpenMP systems, the Habanero-C and Habanero-Java systems from Vivek Sarkar at Rice University, the HPX/ParalleX model from LSU, as well as our own Scalable Parallel Runtime effort (SPR) and the Trilinos stateless kernels. This design pattern is also shared by CUDA and several OpenMP extensions for GPU-type accelerators (e.g. the PGI OpenMP extensions).« less
Implementation and performance of parallel Prolog interpreter

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wei, S.; Kale, L.V.; Balkrishna, R.

1988-01-01

In this paper, the authors discuss the implementation of a parallel Prolog interpreter on different parallel machines. The implementation is based on the REDUCE--OR process model which exploits both AND and OR parallelism in logic programs. It is machine independent as it runs on top of the chare-kernel--a machine-independent parallel programming system. The authors also give the performance of the interpreter running a diverse set of benchmark pargrams on parallel machines including shared memory systems: an Alliant FX/8, Sequent and a MultiMax, and a non-shared memory systems: Intel iPSC/32 hypercube, in addition to its performance on a multiprocessor simulation system.
Integrating Cache Performance Modeling and Tuning Support in Parallelization Tools

NASA Technical Reports Server (NTRS)

Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)

1998-01-01

With the resurgence of distributed shared memory (DSM) systems based on cache-coherent Non Uniform Memory Access (ccNUMA) architectures and increasing disparity between memory and processors speeds, data locality overheads are becoming the greatest bottlenecks in the way of realizing potential high performance of these systems. While parallelization tools and compilers facilitate the users in porting their sequential applications to a DSM system, a lot of time and effort is needed to tune the memory performance of these applications to achieve reasonable speedup. In this paper, we show that integrating cache performance modeling and tuning support within a parallelization environment can alleviate this problem. The Cache Performance Modeling and Prediction Tool (CPMP), employs trace-driven simulation techniques without the overhead of generating and managing detailed address traces. CPMP predicts the cache performance impact of source code level "what-if" modifications in a program to assist a user in the tuning process. CPMP is built on top of a customized version of the Computer Aided Parallelization Tools (CAPTools) environment. Finally, we demonstrate how CPMP can be applied to tune a real Computational Fluid Dynamics (CFD) application.
Automatic selection of dynamic data partitioning schemes for distributed memory multicomputers

NASA Technical Reports Server (NTRS)

Palermo, Daniel J.; Banerjee, Prithviraj

1995-01-01

For distributed memory multicomputers such as the Intel Paragon, the IBM SP-2, the NCUBE/2, and the Thinking Machines CM-5, the quality of the data partitioning for a given application is crucial to obtaining high performance. This task has traditionally been the user's responsibility, but in recent years much effort has been directed to automating the selection of data partitioning schemes. Several researchers have proposed systems that are able to produce data distributions that remain in effect for the entire execution of an application. For complex programs, however, such static data distributions may be insufficient to obtain acceptable performance. The selection of distributions that dynamically change over the course of a program's execution adds another dimension to the data partitioning problem. In this paper, we present a technique that can be used to automatically determine which partitionings are most beneficial over specific sections of a program while taking into account the added overhead of performing redistribution. This system is being built as part of the PARADIGM (PARAllelizing compiler for DIstributed memory General-purpose Multicomputers) project at the University of Illinois. The complete system will provide a fully automated means to parallelize programs written in a serial programming model obtaining high performance on a wide range of distributed-memory multicomputers.
A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction

DOE PAGES

Kumar, B.; Huang, C. -H.; Sadayappan, P.; ...

1995-01-01

In this article, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this article, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storagemore » of size O(7 n ) for multiplying 2 n × 2 n matrices. We present a modified formulation in which the working storage requirement is reduced to O(4 n ). The modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor. Performance results on a Cray Y-MP8/64 are presented.« less
Parallel discrete event simulation: A shared memory approach

NASA Technical Reports Server (NTRS)

Reed, Daniel A.; Malony, Allen D.; Mccredie, Bradley D.

1987-01-01

With traditional event list techniques, evaluating a detailed discrete event simulation model can often require hours or even days of computation time. Parallel simulation mimics the interacting servers and queues of a real system by assigning each simulated entity to a processor. By eliminating the event list and maintaining only sufficient synchronization to insure causality, parallel simulation can potentially provide speedups that are linear in the number of processors. A set of shared memory experiments is presented using the Chandy-Misra distributed simulation algorithm to simulate networks of queues. Parameters include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential simulation of most queueing network models.
BIRD: A general interface for sparse distributed memory simulators

NASA Technical Reports Server (NTRS)

Rogers, David

1990-01-01

Kanerva's sparse distributed memory (SDM) has now been implemented for at least six different computers, including SUN3 workstations, the Apple Macintosh, and the Connection Machine. A common interface for input of commands would both aid testing of programs on a broad range of computer architectures and assist users in transferring results from research environments to applications. A common interface also allows secondary programs to generate command sequences for a sparse distributed memory, which may then be executed on the appropriate hardware. The BIRD program is an attempt to create such an interface. Simplifying access to different simulators should assist developers in finding appropriate uses for SDM.
Common data buffer

NASA Technical Reports Server (NTRS)

Byrne, F.

1981-01-01

Time-shared interface speeds data processing in distributed computer network. Two-level high-speed scanning approach routes information to buffer, portion of which is reserved for series of "first-in, first-out" memory stacks. Buffer address structure and memory are protected from noise or failed components by error correcting code. System is applicable to any computer or processing language.

A Massively Parallel Code for Polarization Calculations

NASA Astrophysics Data System (ADS)

Akiyama, Shizuka; Höflich, Peter

2001-03-01

We present an implementation of our Monte-Carlo radiation transport method for rapidly expanding, NLTE atmospheres for massively parallel computers which utilizes both the distributed and shared memory models. This allows us to take full advantage of the fast communication and low latency inherent to nodes with multiple CPUs, and to stretch the limits of scalability with the number of nodes compared to a version which is based on the shared memory model. Test calculations on a local 20-node Beowulf cluster with dual CPUs showed an improved scalability by about 40%.
ORCA Project: Research on high-performance parallel computer programming environments. Final report, 1 Apr-31 Mar 90

DOE Office of Scientific and Technical Information (OSTI.GOV)

Snyder, L.; Notkin, D.; Adams, L.

1990-03-31

This task relates to research on programming massively parallel computers. Previous work on the Ensamble concept of programming was extended and investigation into nonshared memory models of parallel computation was undertaken. Previous work on the Ensamble concept defined a set of programming abstractions and was used to organize the programming task into three distinct levels; Composition of machine instruction, composition of processes, and composition of phases. It was applied to shared memory models of computations. During the present research period, these concepts were extended to nonshared memory models. During the present research period, one Ph D. thesis was completed, onemore » book chapter, and six conference proceedings were published.« less
A Distributed Operating System for BMD Applications.

DTIC Science & Technology

1982-01-01

Defense) applications executing on distributed hardware with local and shared memories. The objective was to develop real - time operating system functions...make the Basic Real - Time Operating System , and the set of new EPL language primitives that provide BMD application processes with efficient mechanisms
Browndye: A software package for Brownian dynamics

NASA Astrophysics Data System (ADS)

Huber, Gary A.; McCammon, J. Andrew

2010-11-01

A new software package, Browndye, is presented for simulating the diffusional encounter of two large biological molecules. It can be used to estimate second-order rate constants and encounter probabilities, and to explore reaction trajectories. Browndye builds upon previous knowledge and algorithms from software packages such as UHBD, SDA, and Macrodox, while implementing algorithms that scale to larger systems. Program summaryProgram title: Browndye Catalogue identifier: AEGT_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGT_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: MIT license, included in distribution No. of lines in distributed program, including test data, etc.: 143 618 No. of bytes in distributed program, including test data, etc.: 1 067 861 Distribution format: tar.gz Programming language: C++, OCaml ( http://caml.inria.fr/) Computer: PC, Workstation, Cluster Operating system: Linux Has the code been vectorised or parallelized?: Yes. Runs on multiple processors with shared memory using pthreads RAM: Depends linearly on size of physical system Classification: 3 External routines: uses the output of APBS [1] ( http://www.poissonboltzmann.org/apbs/) as input. APBS must be obtained and installed separately. Expat 2.0.1, CLAPACK, ocaml-expat, Mersenne Twister. These are included in the Browndye distribution. Nature of problem: Exploration and determination of rate constants of bimolecular interactions involving large biological molecules. Solution method: Brownian dynamics with electrostatic, excluded volume, van der Waals, and desolvation forces. Running time: Depends linearly on size of physical system and quadratically on precision of results. The included example executes in a few minutes.
Parallel Navier-Stokes computations on shared and distributed memory architectures

NASA Technical Reports Server (NTRS)

Hayder, M. Ehtesham; Jayasimha, D. N.; Pillay, Sasi Kumar

1995-01-01

We study a high order finite difference scheme to solve the time accurate flow field of a jet using the compressible Navier-Stokes equations. As part of our ongoing efforts, we have implemented our numerical model on three parallel computing platforms to study the computational, communication, and scalability characteristics. The platforms chosen for this study are a cluster of workstations connected through fast networks (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and a distributed memory multiprocessor (the IBM SPI). Our focus in this study is on the LACE testbed. We present some results for the Cray YMP and the IBM SP1 mainly for comparison purposes. On the LACE testbed, we study: (1) the communication characteristics of Ethernet, FDDI, and the ALLNODE networks and (2) the overheads induced by the PVM message passing library used for parallelizing the application. We demonstrate that clustering of workstations is effective and has the potential to be computationally competitive with supercomputers at a fraction of the cost.
Ensuring correct rollback recovery in distributed shared memory systems

NASA Technical Reports Server (NTRS)

Janssens, Bob; Fuchs, W. Kent

1995-01-01

Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attractive platform for executing parallel scientific applications. Checkpointing and rollback techniques can be used in such a system to allow the computation to progress in spite of the temporary failure of one or more processing nodes. This paper presents the design of an independent checkpointing method for DSM that takes advantage of DSM's specific properties to reduce error-free and rollback overhead. The scheme reduces the dependencies that need to be considered for correct rollback to those resulting from transfers of pages. Furthermore, in-transit messages can be recovered without the use of logging. We extend the scheme to a DSM implementation using lazy release consistency, where the frequency of dependencies is further reduced.
Expressing Parallelism with ROOT

NASA Astrophysics Data System (ADS)

Piparo, D.; Tejedor, E.; Guiraud, E.; Ganis, G.; Mato, P.; Moneta, L.; Valls Pla, X.; Canal, P.

2017-10-01

The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module in Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.
Expressing Parallelism with ROOT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Piparo, D.; Tejedor, E.; Guiraud, E.

The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module inmore » Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Ibrahim, Khaled Z.; Epifanovsky, Evgeny; Williams, Samuel W.

Coupled-cluster methods provide highly accurate models of molecular structure by explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix-matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts tomore » extend the Libtensor framework to work in the distributed memory environment in a scalable and energy efficient manner. We achieve up to 240 speedup compared with the best optimized shared memory implementation. We attain scalability to hundreds of thousands of compute cores on three distributed-memory architectures, (Cray XC30&XC40, BlueGene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance. Nevertheless, we preserve a uni ed interface to both programming models to maintain the productivity of computational quantum chemists.« less
A sample implementation for parallelizing Divide-and-Conquer algorithms on the GPU.

PubMed

Mei, Gang; Zhang, Jiayin; Xu, Nengxiong; Zhao, Kunyang

2018-01-01

The strategy of Divide-and-Conquer (D&C) is one of the frequently used programming patterns to design efficient algorithms in computer science, which has been parallelized on shared memory systems and distributed memory systems. Tzeng and Owens specifically developed a generic paradigm for parallelizing D&C algorithms on modern Graphics Processing Units (GPUs). In this paper, by following the generic paradigm proposed by Tzeng and Owens, we provide a new and publicly available GPU implementation of the famous D&C algorithm, QuickHull, to give a sample and guide for parallelizing D&C algorithms on the GPU. The experimental results demonstrate the practicality of our sample GPU implementation. Our research objective in this paper is to present a sample GPU implementation of a classical D&C algorithm to help interested readers to develop their own efficient GPU implementations with fewer efforts.
SMT-Aware Instantaneous Footprint Optimization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Roy, Probir; Liu, Xu; Song, Shuaiwen

Modern architectures employ simultaneous multithreading (SMT) to increase thread-level parallelism. SMT threads share many functional units and the whole memory hierarchy of a physical core. Without a careful code design, SMT threads can easily contend with each other for these shared resources, causing severe performance degradation. Minimizing SMT thread contention for HPC applications running on dedicated platforms is very challenging, because they usually spawn threads within Single Program Multiple Data (SPMD) models. To address this important issue, we introduce a simple scheme for SMT-aware code optimization, which aims to reduce the memory contention across SMT threads.
Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures

NASA Technical Reports Server (NTRS)

Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Caubet, Jordi; Biegel, Bryan A. (Technical Monitor)

2002-01-01

In this paper we describe how to apply powerful performance analysis techniques to understand the behavior of multilevel parallel applications. We use the Paraver/OMPItrace performance analysis system for our study. This system consists of two major components: The OMPItrace dynamic instrumentation mechanism, which allows the tracing of processes and threads and the Paraver graphical user interface for inspection and analyses of the generated traces. We describe how to use the system to conduct a detailed comparative study of a benchmark code implemented in five different programming paradigms applicable for shared memory
Dynamic programming on a shared-memory multiprocessor

NASA Technical Reports Server (NTRS)

Edmonds, Phil; Chu, Eleanor; George, Alan

1993-01-01

Three new algorithms for solving dynamic programming problems on a shared-memory parallel computer are described. All three algorithms attempt to balance work load, while keeping synchronization cost low. In particular, for a multiprocessor having p processors, an analysis of the best algorithm shows that the arithmetic cost is O(n-cubed/6p) and that the synchronization cost is O(absolute value of log sub C n) if p much less than n, where C = (2p-1)/(2p + 1) and n is the size of the problem. The low synchronization cost is important for machines where synchronization is expensive. Analysis and experiments show that the best algorithm is effective in balancing the work load and producing high efficiency.
Shared Versus Distributed Memory Multiprocessors

DTIC Science & Technology

1991-01-01

multiprocessors should hawe shared or dis.trimuted meieo-% ha~ trr ~ g ’’~ de~i c4~accio;, S Cm teaicners argue S trongly tor Outiding (li15 tri huted...Applications, MIT Press (1985). 161 D. Gajski et el., "Cedar," Proc. Compcon, pp. 306-309 (Spring 19S9). 171 S. Ahuja, N. Carriero and D. Gelernter, "Linda
Satellite Image Mosaic Engine

NASA Technical Reports Server (NTRS)

Plesea, Lucian

2006-01-01

A computer program automatically builds large, full-resolution mosaics of multispectral images of Earth landmasses from images acquired by Landsat 7, complete with matching of colors and blending between adjacent scenes. While the code has been used extensively for Landsat, it could also be used for other data sources. A single mosaic of as many as 8,000 scenes, represented by more than 5 terabytes of data and the largest set produced in this work, demonstrated what the code could do to provide global coverage. The program first statistically analyzes input images to determine areas of coverage and data-value distributions. It then transforms the input images from their original universal transverse Mercator coordinates to other geographical coordinates, with scaling. It applies a first-order polynomial brightness correction to each band in each scene. It uses a data-mask image for selecting data and blending of input scenes. Under control by a user, the program can be made to operate on small parts of the output image space, with check-point and restart capabilities. The program runs on SGI IRIX computers. It is capable of parallel processing using shared-memory code, large memories, and tens of central processing units. It can retrieve input data and store output data at locations remote from the processors on which it is executed.
Parallel, Distributed Scripting with Python

DOE Office of Scientific and Technical Information (OSTI.GOV)

Miller, P J

2002-05-24

Parallel computers used to be, for the most part, one-of-a-kind systems which were extremely difficult to program portably. With SMP architectures, the advent of the POSIX thread API and OpenMP gave developers ways to portably exploit on-the-box shared memory parallelism. Since these architectures didn't scale cost-effectively, distributed memory clusters were developed. The associated MPI message passing libraries gave these systems a portable paradigm too. Having programmers effectively use this paradigm is a somewhat different question. Distributed data has to be explicitly transported via the messaging system in order for it to be useful. In high level languages, the MPI librarymore » gives access to data distribution routines in C, C++, and FORTRAN. But we need more than that. Many reasonable and common tasks are best done in (or as extensions to) scripting languages. Consider sysadm tools such as password crackers, file purgers, etc ... These are simple to write in a scripting language such as Python (an open source, portable, and freely available interpreter). But these tasks beg to be done in parallel. Consider the a password checker that checks an encrypted password against a 25,000 word dictionary. This can take around 10 seconds in Python (6 seconds in C). It is trivial to parallelize if you can distribute the information and co-ordinate the work.« less
Implementing High-Performance Geometric Multigrid Solver with Naturally Grained Messages

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shan, Hongzhang; Williams, Samuel; Zheng, Yili

2015-10-26

Structured-grid linear solvers often require manually packing and unpacking of communication data to achieve high performance.Orchestrating this process efficiently is challenging, labor-intensive, and potentially error-prone.In this paper, we explore an alternative approach that communicates the data with naturally grained messagesizes without manual packing and unpacking. This approach is the distributed analogue of shared-memory programming, taking advantage of the global addressspace in PGAS languages to provide substantial programming ease. However, its performance may suffer from the large number of small messages. We investigate theruntime support required in the UPC ++ library for this naturally grained version to close the performance gapmore » between the two approaches and attain comparable performance at scale using the High-Performance Geometric Multgrid (HPGMG-FV) benchmark as a driver.« less
Parallel implementation of an adaptive and parameter-free N-body integrator

NASA Astrophysics Data System (ADS)

Pruett, C. David; Ingham, William H.; Herman, Ralph D.

2011-05-01

Previously, Pruett et al. (2003) [3] described an N-body integrator of arbitrarily high order M with an asymptotic operation count of O(MN). The algorithm's structure lends itself readily to data parallelization, which we document and demonstrate here in the integration of point-mass systems subject to Newtonian gravitation. High order is shown to benefit parallel efficiency. The resulting N-body integrator is robust, parameter-free, highly accurate, and adaptive in both time-step and order. Moreover, it exhibits linear speedup on distributed parallel processors, provided that each processor is assigned at least a handful of bodies. Program summaryProgram title: PNB.f90 Catalogue identifier: AEIK_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEIK_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC license, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 3052 No. of bytes in distributed program, including test data, etc.: 68 600 Distribution format: tar.gz Programming language: Fortran 90 and OpenMPI Computer: All shared or distributed memory parallel processors Operating system: Unix/Linux Has the code been vectorized or parallelized?: The code has been parallelized but has not been explicitly vectorized. RAM: Dependent upon N Classification: 4.3, 4.12, 6.5 Nature of problem: High accuracy numerical evaluation of trajectories of N point masses each subject to Newtonian gravitation. Solution method: Parallel and adaptive extrapolation in time via power series of arbitrary degree. Running time: 5.1 s for the demo program supplied with the package.
Automatic Parallelization of Numerical Python Applications using the Global Arrays Toolkit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Daily, Jeffrey A.; Lewis, Robert R.

2011-11-30

Global Arrays is a software system from Pacific Northwest National Laboratory that enables an efficient, portable, and parallel shared-memory programming interface to manipulate distributed dense arrays. The NumPy module is the de facto standard for numerical calculation in the Python programming language, a language whose use is growing rapidly in the scientific and engineering communities. NumPy provides a powerful N-dimensional array class as well as other scientific computing capabilities. However, like the majority of the core Python modules, NumPy is inherently serial. Using a combination of Global Arrays and NumPy, we have reimplemented NumPy as a distributed drop-in replacement calledmore » Global Arrays in NumPy (GAiN). Serial NumPy applications can become parallel, scalable GAiN applications with only minor source code changes. Scalability studies of several different GAiN applications will be presented showing the utility of developing serial NumPy codes which can later run on more capable clusters or supercomputers.« less
OSCAR API for Real-Time Low-Power Multicores and Its Performance on Multicores and SMP Servers

NASA Astrophysics Data System (ADS)

Kimura, Keiji; Mase, Masayoshi; Mikami, Hiroki; Miyamoto, Takamichi; Shirako, Jun; Kasahara, Hironori

OSCAR (Optimally Scheduled Advanced Multiprocessor) API has been designed for real-time embedded low-power multicores to generate parallel programs for various multicores from different vendors by using the OSCAR parallelizing compiler. The OSCAR API has been developed by Waseda University in collaboration with Fujitsu Laboratory, Hitachi, NEC, Panasonic, Renesas Technology, and Toshiba in an METI/NEDO project entitled "Multicore Technology for Realtime Consumer Electronics." By using the OSCAR API as an interface between the OSCAR compiler and backend compilers, the OSCAR compiler enables hierarchical multigrain parallel processing with memory optimization under capacity restriction for cache memory, local memory, distributed shared memory, and on-chip/off-chip shared memory; data transfer using a DMA controller; and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating, and power gating for various embedded multicores. In addition, a parallelized program automatically generated by the OSCAR compiler with OSCAR API can be compiled by the ordinary OpenMP compilers since the OSCAR API is designed on a subset of the OpenMP. This paper describes the OSCAR API and its compatibility with the OSCAR compiler by showing code examples. Performance evaluations of the OSCAR compiler and the OSCAR API are carried out using an IBM Power5+ workstation, an IBM Power6 high-end SMP server, and a newly developed consumer electronics multicore chip RP2 by Renesas, Hitachi and Waseda. From the results of scalability evaluation, it is found that on an average, the OSCAR compiler with the OSCAR API can exploit 5.8 times speedup over the sequential execution on the Power5+ workstation with eight cores and 2.9 times speedup on RP2 with four cores, respectively. In addition, the OSCAR compiler can accelerate an IBM XL Fortran compiler up to 3.3 times on the Power6 SMP server. Due to low-power optimization on RP2, the OSCAR compiler with the OSCAR API achieves a maximum power reduction of 84% in the real-time execution mode.

Efficient computation of aerodynamic influence coefficients for aeroelastic analysis on a transputer network

NASA Technical Reports Server (NTRS)

Janetzke, David C.; Murthy, Durbha V.

1991-01-01

Aeroelastic analysis is multi-disciplinary and computationally expensive. Hence, it can greatly benefit from parallel processing. As part of an effort to develop an aeroelastic capability on a distributed memory transputer network, a parallel algorithm for the computation of aerodynamic influence coefficients is implemented on a network of 32 transputers. The aerodynamic influence coefficients are calculated using a 3-D unsteady aerodynamic model and a parallel discretization. Efficiencies up to 85 percent were demonstrated using 32 processors. The effect of subtask ordering, problem size, and network topology are presented. A comparison to results on a shared memory computer indicates that higher speedup is achieved on the distributed memory system.
Merlin - Massively parallel heterogeneous computing

NASA Technical Reports Server (NTRS)

Wittie, Larry; Maples, Creve

1989-01-01

Hardware and software for Merlin, a new kind of massively parallel computing system, are described. Eight computers are linked as a 300-MIPS prototype to develop system software for a larger Merlin network with 16 to 64 nodes, totaling 600 to 3000 MIPS. These working prototypes help refine a mapped reflective memory technique that offers a new, very general way of linking many types of computer to form supercomputers. Processors share data selectively and rapidly on a word-by-word basis. Fast firmware virtual circuits are reconfigured to match topological needs of individual application programs. Merlin's low-latency memory-sharing interfaces solve many problems in the design of high-performance computing systems. The Merlin prototypes are intended to run parallel programs for scientific applications and to determine hardware and software needs for a future Teraflops Merlin network.
Distributed memory compiler design for sparse problems

NASA Technical Reports Server (NTRS)

Wu, Janet; Saltz, Joel; Berryman, Harry; Hiranandani, Seema

1991-01-01

A compiler and runtime support mechanism is described and demonstrated. The methods presented are capable of solving a wide range of sparse and unstructured problems in scientific computing. The compiler takes as input a FORTRAN 77 program enhanced with specifications for distributing data, and the compiler outputs a message passing program that runs on a distributed memory computer. The runtime support for this compiler is a library of primitives designed to efficiently support irregular patterns of distributed array accesses and irregular distributed array partitions. A variety of Intel iPSC/860 performance results obtained through the use of this compiler are presented.
Crystallographic and general use programs for the XDS Sigma 5 computer

NASA Technical Reports Server (NTRS)

Snyder, R. L.

1973-01-01

Programs in basic FORTRAN 4 are described, which fall into three catagories: (1) interactive programs to be executed under time sharing (BTM); (2) non interactive programs which are executed in batch processing mode (BPM); and (3) large non interactive programs which require more memory than is available in the normal BPM/BTM operating system and must be run overnight on a special system called XRAY which releases about 45,000 words of memory to the user. Programs in catagories (1) and (2) are stored as FORTRAN source files in the account FSNYDER. Programs in catagory (3) are stored in the XRAY system as load modules. The type of file in account FSNYDER is identified by the first two letters in the name.
C++QEDv2: The multi-array concept and compile-time algorithms in the definition of composite quantum systems

NASA Astrophysics Data System (ADS)

Vukics, András

2012-06-01

C++QED is a versatile framework for simulating open quantum dynamics. It allows to build arbitrarily complex quantum systems from elementary free subsystems and interactions, and simulate their time evolution with the available time-evolution drivers. Through this framework, we introduce a design which should be generic for high-level representations of composite quantum systems. It relies heavily on the object-oriented and generic programming paradigms on one hand, and on the other hand, compile-time algorithms, in particular C++ template-metaprogramming techniques. The core of the design is the data structure which represents the state vectors of composite quantum systems. This data structure models the multi-array concept. The use of template metaprogramming is not only crucial to the design, but with its use all computations pertaining to the layout of the simulated system can be shifted to compile time, hence cutting on runtime. Program summaryProgram title: C++QED Catalogue identifier: AELU_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AELU_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions:http://cpc.cs.qub.ac.uk/licence/aelu_v1_0.html. The C++QED package contains other software packages, Blitz, Boost and FLENS, all of which may be distributed freely but have individual license requirements. Please see individual packages for license conditions. No. of lines in distributed program, including test data, etc.: 597 974 No. of bytes in distributed program, including test data, etc.: 4 874 839 Distribution format: tar.gz Programming language: C++ Computer: i386-i686, x86_64 Operating system: In principle cross-platform, as yet tested only on UNIX-like systems (including Mac OS X). RAM: The framework itself takes about 60 MB, which is fully shared. The additional memory taken by the program which defines the actual physical system (script) is typically less than 1 MB. The memory storing the actual data scales with the system dimension for state-vector manipulations, and the square of the dimension for density-operator manipulations. This might easily be GBs, and often the memory of the machine limits the size of the simulated system. Classification: 4.3, 4.13, 6.2, 20 External routines: Boost C++ libraries (http://www.boost.org/), GNU Scientific Library (http://www.gnu.org/software/gsl/), Blitz++ (http://www.oonumerics.org/blitz/), Linear Algebra Package - Flexible Library for Efficient Numerical Solutions (http://flens.sourceforge.net/). Nature of problem: Definition of (open) composite quantum systems out of elementary building blocks [1]. Manipulation of such systems, with emphasis on dynamical simulations such as Master-equation evolution [2] and Monte Carlo wave-function simulation [3]. Solution method: Master equation, Monte Carlo wave-function method. Restrictions: Total dimensionality of the system. Master equation - few thousands. Monte Carlo wave-function trajectory - several millions. Unusual features: Because of the heavy use of compile-time algorithms, compilation of programs written in the framework may take a long time and much memory (up to several GBs). Additional comments: The framework is not a program, but provides and implements an application-programming interface for developing simulations in the indicated problem domain. Supplementary information: http://cppqed.sourceforge.net/. Running time: Depending on the magnitude of the problem, can vary from a few seconds to weeks.
Parallel Conjugate Gradient: Effects of Ordering Strategies, Programming Paradigms, and Architectural Platforms

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Heber, Gerd; Biswas, Rupak

2000-01-01

The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the floating-point operations within a CG iteration. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and SPMV using different programming paradigms and architectures. Results show that for this class of applications, ordering significantly improves overall performance, that cache reuse may be more important than reducing communication, and that it is possible to achieve message passing performance using shared memory constructs through careful data ordering and distribution. However, a multi-threaded implementation of CG on the Tera MTA does not require special ordering or partitioning to obtain high efficiency and scalability.
Multiple-User, Multitasking, Virtual-Memory Computer System

NASA Technical Reports Server (NTRS)

Generazio, Edward R.; Roth, Don J.; Stang, David B.

1993-01-01

Computer system designed and programmed to serve multiple users in research laboratory. Provides for computer control and monitoring of laboratory instruments, acquisition and anlaysis of data from those instruments, and interaction with users via remote terminals. System provides fast access to shared central processing units and associated large (from megabytes to gigabytes) memories. Underlying concept of system also applicable to monitoring and control of industrial processes.
A multiarchitecture parallel-processing development environment

NASA Technical Reports Server (NTRS)

Townsend, Scott; Blech, Richard; Cole, Gary

1993-01-01

A description is given of the hardware and software of a multiprocessor test bed - the second generation Hypercluster system. The Hypercluster architecture consists of a standard hypercube distributed-memory topology, with multiprocessor shared-memory nodes. By using standard, off-the-shelf hardware, the system can be upgraded to use rapidly improving computer technology. The Hypercluster's multiarchitecture nature makes it suitable for researching parallel algorithms in computational field simulation applications (e.g., computational fluid dynamics). The dedicated test-bed environment of the Hypercluster and its custom-built software allows experiments with various parallel-processing concepts such as message passing algorithms, debugging tools, and computational 'steering'. Such research would be difficult, if not impossible, to achieve on shared, commercial systems.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Sewell, Christopher Meyer

This is a set of slides from a guest lecture for a class at the University of Texas, El Paso on visualization and data analysis for high-performance computing. The topics covered are the following: trends in high-performance computing; scientific visualization, such as OpenGL, ray tracing and volume rendering, VTK, and ParaView; data science at scale, such as in-situ visualization, image databases, distributed memory parallelism, shared memory parallelism, VTK-m, "big data", and then an analysis example.
PIPS-SBB: A Parallel Distributed-Memory Branch-and-Bound Algorithm for Stochastic Mixed-Integer Programs

DOE PAGES

Munguia, Lluis-Miquel; Oxberry, Geoffrey; Rajan, Deepak

2016-05-01

Stochastic mixed-integer programs (SMIPs) deal with optimization under uncertainty at many levels of the decision-making process. When solved as extensive formulation mixed- integer programs, problem instances can exceed available memory on a single workstation. In order to overcome this limitation, we present PIPS-SBB: a distributed-memory parallel stochastic MIP solver that takes advantage of parallelism at multiple levels of the optimization process. We also show promising results on the SIPLIB benchmark by combining methods known for accelerating Branch and Bound (B&B) methods with new ideas that leverage the structure of SMIPs. Finally, we expect the performance of PIPS-SBB to improve furthermore » as more functionality is added in the future.« less
Memory operation mechanism of fullerene-containing polymer memory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nakajima, Anri, E-mail: anakajima@hiroshima-u.ac.jp; Fujii, Daiki

2015-03-09

The memory operation mechanism in fullerene-containing nanocomposite gate insulators was investigated while varying the kind of fullerene in a polymer gate insulator. It was cleared what kind of traps and which positions in the nanocomposite the injected electrons or holes are stored in. The reason for the difference in the easiness of programming was clarified taking the role of the charging energy of an injected electron into account. The dependence of the carrier dynamics on the kind of fullerene molecule was investigated. A nonuniform distribution of injected carriers occurred after application of a large magnitude programming voltage due to themore » width distribution of the polystyrene barrier between adjacent fullerene molecules. Through the investigations, we demonstrated a nanocomposite gate with fullerene molecules having excellent retention characteristics and a programming capability. This will lead to the realization of practical organic memories with fullerene-containing polymer nanocomposites.« less
Single Event Upset Analysis: On-orbit performance of the Alpha Magnetic Spectrometer Digital Signal Processor Memory aboard the International Space Station

NASA Astrophysics Data System (ADS)

Li, Jiaqiang; Choutko, Vitaly; Xiao, Liyi

2018-03-01

Based on the collection of error data from the Alpha Magnetic Spectrometer (AMS) Digital Signal Processors (DSP), on-orbit Single Event Upsets (SEUs) of the DSP program memory are analyzed. The daily error distribution and time intervals between errors are calculated to evaluate the reliability of the system. The particle density distribution of International Space Station (ISS) orbit is presented and the effects from the South Atlantic Anomaly (SAA) and the geomagnetic poles are analyzed. The impact of solar events on the DSP program memory is carried out combining data analysis and Monte Carlo simulation (MC). From the analysis and simulation results, it is concluded that the area corresponding to the SAA is the main source of errors on the ISS orbit. Solar events can also cause errors on DSP program memory, but the effect depends on the on-orbit particle density.
Performing an allreduce operation using shared memory

DOEpatents

Archer, Charles J [Rochester, MN; Dozsa, Gabor [Ardsley, NY; Ratterman, Joseph D [Rochester, MN; Smith, Brian E [Rochester, MN

2012-04-17

Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit.
Performing an allreduce operation using shared memory

DOEpatents

Archer, Charles J; Dozsa, Gabor; Ratterman, Joseph D; Smith, Brian E

2014-06-10

Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit.
Simulation Analysis of Data Sharing in Shared Memory Multiprocessors

DTIC Science & Technology

1989-02-24

LIMITATION OF ABSTRACT Same as Report (SAR) 18. NUMBER OF PAGES 178 19a. NAME OF RESPONSIBLE PERSON a. REPORT unclassified b . ABSTRACT unclassified...work. Andrea Casotto (CELL), Steve McGrogan (SPICE), Srinivas Devadas (TOPOP1) and Hi-Keung Tony Ma (VERIFY) donated the parallel programs and a con...Effect of Block Size on B us Utilization 120 5-14 Ratio of Sharing Bus Cyc les to Total Bus Cycles 120 5-15 Oassification of Bus Cyc les for
Scalable Triadic Analysis of Large-Scale Graphs: Multi-Core vs. Multi-Processor vs. Multi-Threaded Shared Memory Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chin, George; Marquez, Andres; Choudhury, Sutanay

2012-09-01

Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less
Sensorimotor memory of object weight distribution during multidigit grasp.

PubMed

Albert, Frederic; Santello, Marco; Gordon, Andrew M

2009-10-09

We studied the ability to transfer three-digit force sharing patterns learned through consecutive lifts of an object with an asymmetric center of mass (CM). After several object lifts, we asked subjects to rotate and translate the object to the contralateral hand and perform one additional lift. This task was performed under two weight conditions (550 and 950 g) to determine the extent to which subjects would be able to transfer weight and CM information. Learning transfer was quantified by measuring the extent to which force sharing patterns and peak object roll on the first post-translation trial resembled those measured on the pre-translation trial with the same CM. We found that the overall gain of fingertip forces was transferred following object rotation, but that the scaling of individual digit forces was specific to the learned digit-object configuration, and thus was not transferred following rotation. As a result, on the first post-translation trial there was a significantly larger object roll following object lift-off than on the pre-translation trial. This suggests that sensorimotor memories for weight, requiring scaling of fingertip force gain, may differ from memories for mass distribution.
The force on the flex: Global parallelism and portability

NASA Technical Reports Server (NTRS)

Jordan, H. F.

1986-01-01

A parallel programming methodology, called the force, supports the construction of programs to be executed in parallel by an unspecified, but potentially large, number of processes. The methodology was originally developed on a pipelined, shared memory multiprocessor, the Denelcor HEP, and embodies the primitive operations of the force in a set of macros which expand into multiprocessor Fortran code. A small set of primitives is sufficient to write large parallel programs, and the system has been used to produce 10,000 line programs in computational fluid dynamics. The level of complexity of the force primitives is intermediate. It is high enough to mask detailed architectural differences between multiprocessors but low enough to give the user control over performance. The system is being ported to a medium scale multiprocessor, the Flex/32, which is a 20 processor system with a mixture of shared and local memory. Memory organization and the type of processor synchronization supported by the hardware on the two machines lead to some differences in efficient implementations of the force primitives, but the user interface remains the same. An initial implementation was done by retargeting the macros to Flexible Computer Corporation's ConCurrent C language. Subsequently, the macros were caused to directly produce the system calls which form the basis for ConCurrent C. The implementation of the Fortran based system is in step with Flexible Computer Corporations's implementation of a Fortran system in the parallel environment.
OFF, Open source Finite volume Fluid dynamics code: A free, high-order solver based on parallel, modular, object-oriented Fortran API

NASA Astrophysics Data System (ADS)

Zaghi, S.

2014-07-01

OFF, an open source (free software) code for performing fluid dynamics simulations, is presented. The aim of OFF is to solve, numerically, the unsteady (and steady) compressible Navier-Stokes equations of fluid dynamics by means of finite volume techniques: the research background is mainly focused on high-order (WENO) schemes for multi-fluids, multi-phase flows over complex geometries. To this purpose a highly modular, object-oriented application program interface (API) has been developed. In particular, the concepts of data encapsulation and inheritance available within Fortran language (from standard 2003) have been stressed in order to represent each fluid dynamics "entity" (e.g. the conservative variables of a finite volume, its geometry, etc…) by a single object so that a large variety of computational libraries can be easily (and efficiently) developed upon these objects. The main features of OFF can be summarized as follows: Programming LanguageOFF is written in standard (compliant) Fortran 2003; its design is highly modular in order to enhance simplicity of use and maintenance without compromising the efficiency; Parallel Frameworks Supported the development of OFF has been also targeted to maximize the computational efficiency: the code is designed to run on shared-memory multi-cores workstations and distributed-memory clusters of shared-memory nodes (supercomputers); the code's parallelization is based on Open Multiprocessing (OpenMP) and Message Passing Interface (MPI) paradigms; Usability, Maintenance and Enhancement in order to improve the usability, maintenance and enhancement of the code also the documentation has been carefully taken into account; the documentation is built upon comprehensive comments placed directly into the source files (no external documentation files needed): these comments are parsed by means of doxygen free software producing high quality html and latex documentation pages; the distributed versioning system referred as git has been adopted in order to facilitate the collaborative maintenance and improvement of the code; CopyrightsOFF is a free software that anyone can use, copy, distribute, study, change and improve under the GNU Public License version 3. The present paper is a manifesto of OFF code and presents the currently implemented features and ongoing developments. This work is focused on the computational techniques adopted and a detailed description of the main API characteristics is reported. OFF capabilities are demonstrated by means of one and two dimensional examples and a three dimensional real application.
Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

NASA Astrophysics Data System (ADS)

Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

2015-09-01

The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.

Portable programming on parallel/networked computers using the Application Portable Parallel Library (APPL)

NASA Technical Reports Server (NTRS)

Quealy, Angela; Cole, Gary L.; Blech, Richard A.

1993-01-01

The Application Portable Parallel Library (APPL) is a subroutine-based library of communication primitives that is callable from applications written in FORTRAN or C. APPL provides a consistent programmer interface to a variety of distributed and shared-memory multiprocessor MIMD machines. The objective of APPL is to minimize the effort required to move parallel applications from one machine to another, or to a network of homogeneous machines. APPL encompasses many of the message-passing primitives that are currently available on commercial multiprocessor systems. This paper describes APPL (version 2.3.1) and its usage, reports the status of the APPL project, and indicates possible directions for the future. Several applications using APPL are discussed, as well as performance and overhead results.
MPF: A portable message passing facility for shared memory multiprocessors

NASA Technical Reports Server (NTRS)

Malony, Allen D.; Reed, Daniel A.; Mcguire, Patrick J.

1987-01-01

The design, implementation, and performance evaluation of a message passing facility (MPF) for shared memory multiprocessors are presented. The MPF is based on a message passing model conceptually similar to conversations. Participants (parallel processors) can enter or leave a conversation at any time. The message passing primitives for this model are implemented as a portable library of C function calls. The MPF is currently operational on a Sequent Balance 21000, and several parallel applications were developed and tested. Several simple benchmark programs are presented to establish interprocess communication performance for common patterns of interprocess communication. Finally, performance figures are presented for two parallel applications, linear systems solution, and iterative solution of partial differential equations.
Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures

DTIC Science & Technology

2017-10-04

Report: Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures The views, opinions and/or findings contained in this...Chapel Hill Title: Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures Report Term: 0-Other Email: dm...algorithms for scientific and geometric computing by exploiting the power and performance efficiency of heterogeneous shared memory architectures . These
Distributed Systems Technology Survey.

DTIC Science & Technology

1987-03-01

and prolocols. 2. Hardware Technology Ecnomic factor we a majo reonm for the prolierat of dlstbted systoe. Processors, memory, an magne tc ndoptical...destined messages and pertorn the a pro te forwarding. There gImsno agreement that a ightweight process mechanism is essential to support com- monly used...Xerox PARC environment [311. Shared file servers, discussed below, are essential to the success of such a scheme. 11. ecurlity A distributed
Data traffic reduction schemes for sparse Cholesky factorizations

NASA Technical Reports Server (NTRS)

Naik, Vijay K.; Patrick, Merrell L.

1988-01-01

Load distribution schemes are presented which minimize the total data traffic in the Cholesky factorization of dense and sparse, symmetric, positive definite matrices on multiprocessor systems with local and shared memory. The total data traffic in factoring an n x n sparse, symmetric, positive definite matrix representing an n-vertex regular 2-D grid graph using n (sup alpha), alpha is equal to or less than 1, processors are shown to be O(n(sup 1 + alpha/2)). It is O(n(sup 3/2)), when n (sup alpha), alpha is equal to or greater than 1, processors are used. Under the conditions of uniform load distribution, these results are shown to be asymptotically optimal. The schemes allow efficient use of up to O(n) processors before the total data traffic reaches the maximum value of O(n(sup 3/2)). The partitioning employed within the scheme, allows a better utilization of the data accessed from shared memory than those of previously published methods.
Working Memory and Decision-Making in a Frontoparietal Circuit Model

PubMed Central

2017-01-01

Working memory (WM) and decision-making (DM) are fundamental cognitive functions involving a distributed interacting network of brain areas, with the posterior parietal cortex (PPC) and prefrontal cortex (PFC) at the core. However, the shared and distinct roles of these areas and the nature of their coordination in cognitive function remain poorly understood. Biophysically based computational models of cortical circuits have provided insights into the mechanisms supporting these functions, yet they have primarily focused on the local microcircuit level, raising questions about the principles for distributed cognitive computation in multiregional networks. To examine these issues, we developed a distributed circuit model of two reciprocally interacting modules representing PPC and PFC circuits. The circuit architecture includes hierarchical differences in local recurrent structure and implements reciprocal long-range projections. This parsimonious model captures a range of behavioral and neuronal features of frontoparietal circuits across multiple WM and DM paradigms. In the context of WM, both areas exhibit persistent activity, but, in response to intervening distractors, PPC transiently encodes distractors while PFC filters distractors and supports WM robustness. With regard to DM, the PPC module generates graded representations of accumulated evidence supporting target selection, while the PFC module generates more categorical responses related to action or choice. These findings suggest computational principles for distributed, hierarchical processing in cortex during cognitive function and provide a framework for extension to multiregional models. SIGNIFICANCE STATEMENT Working memory and decision-making are fundamental “building blocks” of cognition, and deficits in these functions are associated with neuropsychiatric disorders such as schizophrenia. These cognitive functions engage distributed networks with prefrontal cortex (PFC) and posterior parietal cortex (PPC) at the core. It is not clear, however, what the contributions of PPC and PFC are in light of the computations that subserve working memory and decision-making. We constructed a biophysical model of a reciprocally connected frontoparietal circuit that revealed shared and distinct functions for the PFC and PPC across working memory and decision-making tasks. Our parsimonious model connects circuit-level properties to cognitive functions and suggests novel design principles beyond those of local circuits for cognitive processing in multiregional brain networks. PMID:29114071
Working Memory and Decision-Making in a Frontoparietal Circuit Model.

PubMed

Murray, John D; Jaramillo, Jorge; Wang, Xiao-Jing

2017-12-13

Working memory (WM) and decision-making (DM) are fundamental cognitive functions involving a distributed interacting network of brain areas, with the posterior parietal cortex (PPC) and prefrontal cortex (PFC) at the core. However, the shared and distinct roles of these areas and the nature of their coordination in cognitive function remain poorly understood. Biophysically based computational models of cortical circuits have provided insights into the mechanisms supporting these functions, yet they have primarily focused on the local microcircuit level, raising questions about the principles for distributed cognitive computation in multiregional networks. To examine these issues, we developed a distributed circuit model of two reciprocally interacting modules representing PPC and PFC circuits. The circuit architecture includes hierarchical differences in local recurrent structure and implements reciprocal long-range projections. This parsimonious model captures a range of behavioral and neuronal features of frontoparietal circuits across multiple WM and DM paradigms. In the context of WM, both areas exhibit persistent activity, but, in response to intervening distractors, PPC transiently encodes distractors while PFC filters distractors and supports WM robustness. With regard to DM, the PPC module generates graded representations of accumulated evidence supporting target selection, while the PFC module generates more categorical responses related to action or choice. These findings suggest computational principles for distributed, hierarchical processing in cortex during cognitive function and provide a framework for extension to multiregional models. SIGNIFICANCE STATEMENT Working memory and decision-making are fundamental "building blocks" of cognition, and deficits in these functions are associated with neuropsychiatric disorders such as schizophrenia. These cognitive functions engage distributed networks with prefrontal cortex (PFC) and posterior parietal cortex (PPC) at the core. It is not clear, however, what the contributions of PPC and PFC are in light of the computations that subserve working memory and decision-making. We constructed a biophysical model of a reciprocally connected frontoparietal circuit that revealed shared and distinct functions for the PFC and PPC across working memory and decision-making tasks. Our parsimonious model connects circuit-level properties to cognitive functions and suggests novel design principles beyond those of local circuits for cognitive processing in multiregional brain networks. Copyright © 2017 the authors 0270-6474/17/3712167-20$15.00/0.
The effect of the order in which episodic autobiographical memories versus autobiographical knowledge are shared on feelings of closeness.

PubMed

Brandon, Nicole R; Beike, Denise R; Cole, Holly E

2017-07-01

Autobiographical memories (AMs) can be used to create and maintain closeness with others [Alea, N., & Bluck, S. (2003). Why are you telling me that? A conceptual model of the social function of autobiographical memory. Memory, 11(2), 165-178]. However, the differential effects of memory specificity are not well established. Two studies with 148 participants tested whether the order in which autobiographical knowledge (AK) and specific episodic AM (EAM) are shared affects feelings of closeness. Participants read two memories hypothetically shared by each of four strangers. The strangers first shared either AK or an EAM, and then shared either AK or an EAM. Participants were randomly assigned to read either positive or negative AMs from the strangers. Findings suggest that people feel closer to those who share positive AMs in the same way they construct memories: starting with general and moving to specific.
Self-defining memories, scripts, and the life story: narrative identity in personality and psychotherapy.

PubMed

Singer, Jefferson A; Blagov, Pavel; Berry, Meredith; Oost, Kathryn M

2013-12-01

An integrative model of narrative identity builds on a dual memory system that draws on episodic memory and a long-term self to generate autobiographical memories. Autobiographical memories related to critical goals in a lifetime period lead to life-story memories, which in turn become self-defining memories when linked to an individual's enduring concerns. Self-defining memories that share repetitive emotion-outcome sequences yield narrative scripts, abstracted templates that filter cognitive-affective processing. The life story is the individual's overarching narrative that provides unity and purpose over the life course. Healthy narrative identity combines memory specificity with adaptive meaning-making to achieve insight and well-being, as demonstrated through a literature review of personality and clinical research, as well as new findings from our own research program. A clinical case study drawing on this narrative identity model is also presented with implications for treatment and research. © 2012 Wiley Periodicals, Inc.
Hybrid Memory Management for Parallel Execution of Prolog on Shared Memory Multiprocessors

DTIC Science & Technology

1990-06-01

organizing data to increase locality. The stack structure exhibits greater locality than the heap structure. Tradeoff decisions can also be made on...PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES...University of California at Berkeley,Department of Electrical Engineering and Computer Sciences,Berkeley,CA,94720 8. PERFORMING ORGANIZATION REPORT
A Wideband Fast Multipole Method for the two-dimensional complex Helmholtz equation

NASA Astrophysics Data System (ADS)

Cho, Min Hyung; Cai, Wei

2010-12-01

A Wideband Fast Multipole Method (FMM) for the 2D Helmholtz equation is presented. It can evaluate the interactions between N particles governed by the fundamental solution of 2D complex Helmholtz equation in a fast manner for a wide range of complex wave number k, which was not easy with the original FMM due to the instability of the diagonalized conversion operator. This paper includes the description of theoretical backgrounds, the FMM algorithm, software structures, and some test runs. Program summaryProgram title: 2D-WFMM Catalogue identifier: AEHI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEHI_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 4636 No. of bytes in distributed program, including test data, etc.: 82 582 Distribution format: tar.gz Programming language: C Computer: Any Operating system: Any operating system with gcc version 4.2 or newer Has the code been vectorized or parallelized?: Multi-core processors with shared memory RAM: Depending on the number of particles N and the wave number k Classification: 4.8, 4.12 External routines: OpenMP ( http://openmp.org/wp/) Nature of problem: Evaluate interaction between N particles governed by the fundamental solution of 2D Helmholtz equation with complex k. Solution method: Multilevel Fast Multipole Algorithm in a hierarchical quad-tree structure with cutoff level which combines low frequency method and high frequency method. Running time: Depending on the number of particles N, wave number k, and number of cores in CPU. CPU time increases as N log N.
Unconditional security of entanglement-based continuous-variable quantum secret sharing

NASA Astrophysics Data System (ADS)

Kogias, Ioannis; Xiang, Yu; He, Qiongyi; Adesso, Gerardo

2017-01-01

The need for secrecy and security is essential in communication. Secret sharing is a conventional protocol to distribute a secret message to a group of parties, who cannot access it individually but need to cooperate in order to decode it. While several variants of this protocol have been investigated, including realizations using quantum systems, the security of quantum secret sharing schemes still remains unproven almost two decades after their original conception. Here we establish an unconditional security proof for entanglement-based continuous-variable quantum secret sharing schemes, in the limit of asymptotic keys and for an arbitrary number of players. We tackle the problem by resorting to the recently developed one-sided device-independent approach to quantum key distribution. We demonstrate theoretically the feasibility of our scheme, which can be implemented by Gaussian states and homodyne measurements, with no need for ideal single-photon sources or quantum memories. Our results contribute to validating quantum secret sharing as a viable primitive for quantum technologies.
Toward Enhancing OpenMP's Work-Sharing Directives

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chapman, B M; Huang, L; Jin, H

2006-05-17

OpenMP provides a portable programming interface for shared memory parallel computers (SMPs). Although this interface has proven successful for small SMPs, it requires greater flexibility in light of the steadily growing size of individual SMPs and the recent advent of multithreaded chips. In this paper, we describe two application development experiences that exposed these expressivity problems in the current OpenMP specification. We then propose mechanisms to overcome these limitations, including thread subteams and thread topologies. Thus, we identify language features that improve OpenMP application performance on emerging and large-scale platforms while preserving ease of programming.
Distributed memory compiler methods for irregular problems: Data copy reuse and runtime partitioning

NASA Technical Reports Server (NTRS)

Das, Raja; Ponnusamy, Ravi; Saltz, Joel; Mavriplis, Dimitri

1991-01-01

Outlined here are two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on iPSC/860 to demonstrate the usefulness of our methods.
Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems.

PubMed

Pan, Tony; Flick, Patrick; Jain, Chirag; Liu, Yongchao; Aluru, Srinivas

2017-10-09

Counting and indexing fixed length substrings, or k-mers, in biological sequences is a key step in many bioinformatics tasks including genome alignment and mapping, genome assembly, and error correction. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, few bioinformatics tools can efficiently process the datasets at the current generation rate of 1.8 terabases every 3 days. We present Kmerind, a high performance parallel k-mer indexing library for distributed memory environments. The Kmerind library provides a set of simple and consistent APIs with sequential semantics and parallel implementations that are designed to be flexible and extensible. Kmerind's k-mer counter performs similarly or better than the best existing k-mer counting tools even on shared memory systems. In a distributed memory environment, Kmerind counts k-mers in a 120 GB sequence read dataset in less than 13 seconds on 1024 Xeon CPU cores, and fully indexes their positions in approximately 17 seconds. Querying for 1% of the k-mers in these indices can be completed in 0.23 seconds and 28 seconds, respectively. Kmerind is the first k-mer indexing library for distributed memory environments, and the first extensible library for general k-mer indexing and counting. Kmerind is available at https://github.com/ParBLiSS/kmerind.
The structure of the clouds distributed operating system

NASA Technical Reports Server (NTRS)

Dasgupta, Partha; Leblanc, Richard J., Jr.

1989-01-01

A novel system architecture, based on the object model, is the central structuring concept used in the Clouds distributed operating system. This architecture makes Clouds attractive over a wide class of machines and environments. Clouds is a native operating system, designed and implemented at Georgia Tech. and runs on a set of generated purpose computers connected via a local area network. The system architecture of Clouds is composed of a system-wide global set of persistent (long-lived) virtual address spaces, called objects that contain persistent data and code. The object concept is implemented at the operating system level, thus presenting a single level storage view to the user. Lightweight treads carry computational activity through the code stored in the objects. The persistent objects and threads gives rise to a programming environment composed of shared permanent memory, dispensing with the need for hardware-derived concepts such as the file systems and message systems. Though the hardware may be distributed and may have disks and networks, the Clouds provides the applications with a logically centralized system, based on a shared, structured, single level store. The current design of Clouds uses a minimalist philosophy with respect to both the kernel and the operating system. That is, the kernel and the operating system support a bare minimum of functionality. Clouds also adheres to the concept of separation of policy and mechanism. Most low-level operating system services are implemented above the kernel and most high level services are implemented at the user level. From the measured performance of using the kernel mechanisms, we are able to demonstrate that efficient implementations are feasible for the object model on commercially available hardware. Clouds provides a rich environment for conducting research in distributed systems. Some of the topics addressed in this paper include distributed programming environments, consistency of persistent data and fault-tolerance.
Multithreaded transactions in scientific computing: New versions of a computer program for kinematical calculations of RHEED intensity oscillations

NASA Astrophysics Data System (ADS)

Brzuszek, Marcin; Daniluk, Andrzej

2006-11-01

Writing a concurrent program can be more difficult than writing a sequential program. Programmer needs to think about synchronisation, race conditions and shared variables. Transactions help reduce the inconvenience of using threads. A transaction is an abstraction, which allows programmers to group a sequence of actions on the program into a logical, higher-level computation unit. This paper presents multithreaded versions of the GROWTH program, which allow to calculate the layer coverages during the growth of thin epitaxial films and the corresponding RHEED intensities according to the kinematical approximation. The presented programs also contain graphical user interfaces, which enable displaying program data at run-time. New version program summaryTitles of programs:GROWTHGr, GROWTH06 Catalogue identifier:ADVL_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADVL_v2_0 Program obtainable from:CPC Program Library, Queen's University of Belfast, N. Ireland Catalogue identifier of previous version:ADVL Does the new version supersede the original program:No Computer for which the new version is designed and others on which it has been tested: Pentium-based PC Operating systems or monitors under which the new version has been tested: Windows 9x, XP, NT Programming language used:Object Pascal Memory required to execute with typical data:More than 1 MB Number of bits in a word:64 bits Number of processors used:1 No. of lines in distributed program, including test data, etc.:20 931 Number of bytes in distributed program, including test data, etc.: 1 311 268 Distribution format:tar.gz Nature of physical problem: The programs compute the RHEED intensities during the growth of thin epitaxial structures prepared using the molecular beam epitaxy (MBE). The computations are based on the use of kinematical diffraction theory [P.I. Cohen, G.S. Petrich, P.R. Pukite, G.J. Whaley, A.S. Arrott, Surf. Sci. 216 (1989) 222. [1
A manual for PARTI runtime primitives

NASA Technical Reports Server (NTRS)

Berryman, Harry; Saltz, Joel

1990-01-01

Primitives are presented that are designed to help users efficiently program irregular problems (e.g., unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These primitives are also designed for use in compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and receive messages are automatically generated.
Concurrent Image Processing Executive (CIPE). Volume 1: Design overview

NASA Technical Reports Server (NTRS)

Lee, Meemong; Groom, Steven L.; Mazer, Alan S.; Williams, Winifred I.

1990-01-01

The design and implementation of a Concurrent Image Processing Executive (CIPE), which is intended to become the support system software for a prototype high performance science analysis workstation are described. The target machine for this software is a JPL/Caltech Mark 3fp Hypercube hosted by either a MASSCOMP 5600 or a Sun-3, Sun-4 workstation; however, the design will accommodate other concurrent machines of similar architecture, i.e., local memory, multiple-instruction-multiple-data (MIMD) machines. The CIPE system provides both a multimode user interface and an applications programmer interface, and has been designed around four loosely coupled modules: user interface, host-resident executive, hypercube-resident executive, and application functions. The loose coupling between modules allows modification of a particular module without significantly affecting the other modules in the system. In order to enhance hypercube memory utilization and to allow expansion of image processing capabilities, a specialized program management method, incremental loading, was devised. To minimize data transfer between host and hypercube, a data management method which distributes, redistributes, and tracks data set information was implemented. The data management also allows data sharing among application programs. The CIPE software architecture provides a flexible environment for scientific analysis of complex remote sensing image data, such as planetary data and imaging spectrometry, utilizing state-of-the-art concurrent computation capabilities.
Execution models for mapping programs onto distributed memory parallel computers

NASA Technical Reports Server (NTRS)

Sussman, Alan

1992-01-01

The problem of exploiting the parallelism available in a program to efficiently employ the resources of the target machine is addressed. The problem is discussed in the context of building a mapping compiler for a distributed memory parallel machine. The paper describes using execution models to drive the process of mapping a program in the most efficient way onto a particular machine. Through analysis of the execution models for several mapping techniques for one class of programs, we show that the selection of the best technique for a particular program instance can make a significant difference in performance. On the other hand, the results of benchmarks from an implementation of a mapping compiler show that our execution models are accurate enough to select the best mapping technique for a given program.

Guess Who's Coming to Graduation?

ERIC Educational Resources Information Center

Miller, Ann

2010-01-01

In this article, the author shares her experience in assisting seniors and two teachers at Kalamazoo (Michigan) Central High School when they joined a national contest to have the President of the United States deliver the commencement address to the class of 2010. One of the author's favorite memories surrounds the distribution of commencement…
Shared Semantics and the Use of Organizational Memories for E-Mail Communications.

ERIC Educational Resources Information Center

Schwartz, David G.

1998-01-01

Examines the use of shared semantics information to link concepts in an organizational memory to e-mail communications. Presents a framework for determining shared semantics based on organizational and personal user profiles. Illustrates how shared semantics are used by the HyperMail system to help link organizational memories (OM) content to…
A class Hierarchical, object-oriented approach to virtual memory management

NASA Technical Reports Server (NTRS)

Russo, Vincent F.; Campbell, Roy H.; Johnston, Gary M.

1989-01-01

The Choices family of operating systems exploits class hierarchies and object-oriented programming to facilitate the construction of customized operating systems for shared memory and networked multiprocessors. The software is being used in the Tapestry laboratory to study the performance of algorithms, mechanisms, and policies for parallel systems. Described here are the architectural design and class hierarchy of the Choices virtual memory management system. The software and hardware mechanisms and policies of a virtual memory system implement a memory hierarchy that exploits the trade-off between response times and storage capacities. In Choices, the notion of a memory hierarchy is captured by abstract classes. Concrete subclasses of those abstractions implement a virtual address space, segmentation, paging, physical memory management, secondary storage, and remote (that is, networked) storage. Captured in the notion of a memory hierarchy are classes that represent memory objects. These classes provide a storage mechanism that contains encapsulated data and have methods to read or write the memory object. Each of these classes provides specializations to represent the memory hierarchy.
C-MOS array design techniques: SUMC multiprocessor system study

NASA Technical Reports Server (NTRS)

Clapp, W. A.; Helbig, W. A.; Merriam, A. S.

1972-01-01

The current capabilities of LSI techniques for speed and reliability, plus the possibilities of assembling large configurations of LSI logic and storage elements, have demanded the study of multiprocessors and multiprocessing techniques, problems, and potentialities. Evaluated are three previous systems studies for a space ultrareliable modular computer multiprocessing system, and a new multiprocessing system is proposed that is flexibly configured with up to four central processors, four 1/0 processors, and 16 main memory units, plus auxiliary memory and peripheral devices. This multiprocessor system features a multilevel interrupt, qualified S/360 compatibility for ground-based generation of programs, virtual memory management of a storage hierarchy through 1/0 processors, and multiport access to multiple and shared memory units.
WWWinda Orchestrator: a mechanism for coordinating distributed flocks of Java Applets

NASA Astrophysics Data System (ADS)

Gutfreund, Yechezkal-Shimon; Nicol, John R.

1997-01-01

The WWWinda Orchestrator is a simple but powerful tool for coordinating distributed Java applets. Loosely derived from the Linda programming language developed by David Gelernter and Nicholas Carriero of Yale, WWWinda implements a distributed shared object space called TupleSpace where applets can post, read, or permanently store arbitrary Java objects. In this manner, applets can easily share information without being aware of the underlying communication mechanisms. WWWinda is a very useful for orchestrating flocks of distributed Java applets. Coordination event scan be posted to WWWinda TupleSpace and used to orchestrate the actions of remote applets. Applets can easily share information via the TupleSpace. The technology combines several functions in one simple metaphor: distributed web objects, remote messaging between applets, distributed synchronization mechanisms, object- oriented database, and a distributed event signaling mechanisms. WWWinda can be used a s platform for implementing shared VRML environments, shared groupware environments, controlling remote devices such as cameras, distributed Karaoke, distributed gaming, and shared audio and video experiences.
A compositional reservoir simulator on distributed memory parallel computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rame, M.; Delshad, M.

1995-12-31

This paper presents the application of distributed memory parallel computes to field scale reservoir simulations using a parallel version of UTCHEM, The University of Texas Chemical Flooding Simulator. The model is a general purpose highly vectorized chemical compositional simulator that can simulate a wide range of displacement processes at both field and laboratory scales. The original simulator was modified to run on both distributed memory parallel machines (Intel iPSC/960 and Delta, Connection Machine 5, Kendall Square 1 and 2, and CRAY T3D) and a cluster of workstations. A domain decomposition approach has been taken towards parallelization of the code. Amore » portion of the discrete reservoir model is assigned to each processor by a set-up routine that attempts a data layout as even as possible from the load-balance standpoint. Each of these subdomains is extended so that data can be shared between adjacent processors for stencil computation. The added routines that make parallel execution possible are written in a modular fashion that makes the porting to new parallel platforms straight forward. Results of the distributed memory computing performance of Parallel simulator are presented for field scale applications such as tracer flood and polymer flood. A comparison of the wall-clock times for same problems on a vector supercomputer is also presented.« less
Strategies for Energy Efficient Resource Management of Hybrid Programming Models

DOE Office of Scientific and Technical Information (OSTI.GOV)

Li, Dong; Supinski, Bronis de; Schulz, Martin

2013-01-01

Many scientific applications are programmed using hybrid programming models that use both message-passing and shared-memory, due to the increasing prevalence of large-scale systems with multicore, multisocket nodes. Previous work has shown that energy efficiency can be improved using software-controlled execution schemes that consider both the programming model and the power-aware execution capabilities of the system. However, such approaches have focused on identifying optimal resource utilization for one programming model, either shared-memory or message-passing, in isolation. The potential solution space, thus the challenge, increases substantially when optimizing hybrid models since the possible resource configurations increase exponentially. Nonetheless, with the accelerating adoptionmore » of hybrid programming models, we increasingly need improved energy efficiency in hybrid parallel applications on large-scale systems. In this work, we present new software-controlled execution schemes that consider the effects of dynamic concurrency throttling (DCT) and dynamic voltage and frequency scaling (DVFS) in the context of hybrid programming models. Specifically, we present predictive models and novel algorithms based on statistical analysis that anticipate application power and time requirements under different concurrency and frequency configurations. We apply our models and methods to the NPB MZ benchmarks and selected applications from the ASC Sequoia codes. Overall, we achieve substantial energy savings (8.74% on average and up to 13.8%) with some performance gain (up to 7.5%) or negligible performance loss.« less
High-Performance Computation of Distributed-Memory Parallel 3D Voronoi and Delaunay Tessellation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Peterka, Tom; Morozov, Dmitriy; Phillips, Carolyn

2014-11-14

Computing a Voronoi or Delaunay tessellation from a set of points is a core part of the analysis of many simulated and measured datasets: N-body simulations, molecular dynamics codes, and LIDAR point clouds are just a few examples. Such computational geometry methods are common in data analysis and visualization; but as the scale of simulations and observations surpasses billions of particles, the existing serial and shared-memory algorithms no longer suffice. A distributed-memory scalable parallel algorithm is the only feasible approach. The primary contribution of this paper is a new parallel Delaunay and Voronoi tessellation algorithm that automatically determines which neighbormore » points need to be exchanged among the subdomains of a spatial decomposition. Other contributions include periodic and wall boundary conditions, comparison of our method using two popular serial libraries, and application to numerous science datasets.« less
A manual for PARTI runtime primitives, revision 1

NASA Technical Reports Server (NTRS)

Das, Raja; Saltz, Joel; Berryman, Harry

1991-01-01

Primitives are presented that are designed to help users efficiently program irregular problems (e.g., unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These primitives are also designed for use in compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and receive messages are automatically generated.
What we remember affects how we see: spatial working memory steers saccade programming.

PubMed

Wong, Jason H; Peterson, Matthew S

2013-02-01

Relationships between visual attention, saccade programming, and visual working memory have been hypothesized for over a decade. Awh, Jonides, and Reuter-Lorenz (Journal of Experimental Psychology: Human Perception and Performance 24(3):780-90, 1998) and Awh et al. (Psychological Science 10(5):433-437, 1999) proposed that rehearsing a location in memory also leads to enhanced attentional processing at that location. In regard to eye movements, Belopolsky and Theeuwes (Attention, Perception & Psychophysics 71(3):620-631, 2009) found that holding a location in working memory affects saccade programming, albeit negatively. In three experiments, we attempted to replicate the findings of Belopolsky and Theeuwes (Attention, Perception & Psychophysics 71(3):620-631, 2009) and determine whether the spatial memory effect can occur in other saccade-cuing paradigms, including endogenous central arrow cues and exogenous irrelevant singletons. In the first experiment, our results were the opposite of those in Belopolsky and Theeuwes (Attention, Perception & Psychophysics 71(3):620-631, 2009), in that we found facilitation (shorter saccade latencies) instead of inhibition when the saccade target matched the region in spatial working memory. In Experiment 2, we sought to determine whether the spatial working memory effect would generalize to other endogenous cuing tasks, such as a central arrow that pointed to one of six possible peripheral locations. As in Experiment 1, we found that saccade programming was facilitated when the cued location coincided with the saccade target. In Experiment 3, we explored how spatial memory interacts with other types of cues, such as a peripheral color singleton target or irrelevant onset. In both cases, the eyes were more likely to go to either singleton when it coincided with the location held in spatial working memory. On the basis of these results, we conclude that spatial working memory and saccade programming are likely to share common overlapping circuitry.
Hierarchical Process Composition: Dynamic Maintenance of Structure in a Distributed Environment

DTIC Science & Technology

1988-01-01

One prominent hne of research stresses the independence of address space and thread of control, and the resulting efficiencies due to shared memory...cooperating processes. StarOS focuses on case of use and a general capability mechanism, while Medusa stresses the effect of distributed hardware on system...process structure and the asynchrony among agents and between agents and sources of failure. By stressing dynamic structure, we are led to adopt an
The OpenMP Implementation of NAS Parallel Benchmarks and its Performance

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Frumkin, Michael; Yan, Jerry

1999-01-01

As the new ccNUMA architecture became popular in recent years, parallel programming with compiler directives on these machines has evolved to accommodate new needs. In this study, we examine the effectiveness of OpenMP directives for parallelizing the NAS Parallel Benchmarks. Implementation details will be discussed and performance will be compared with the MPI implementation. We have demonstrated that OpenMP can achieve very good results for parallelization on a shared memory system, but effective use of memory and cache is very important.
MaMR: High-performance MapReduce programming model for material cloud applications

NASA Astrophysics Data System (ADS)

Jing, Weipeng; Tong, Danyu; Wang, Yangang; Wang, Jingyuan; Liu, Yaqiu; Zhao, Peng

2017-02-01

With the increasing data size in materials science, existing programming models no longer satisfy the application requirements. MapReduce is a programming model that enables the easy development of scalable parallel applications to process big data on cloud computing systems. However, this model does not directly support the processing of multiple related data, and the processing performance does not reflect the advantages of cloud computing. To enhance the capability of workflow applications in material data processing, we defined a programming model for material cloud applications that supports multiple different Map and Reduce functions running concurrently based on hybrid share-memory BSP called MaMR. An optimized data sharing strategy to supply the shared data to the different Map and Reduce stages was also designed. We added a new merge phase to MapReduce that can efficiently merge data from the map and reduce modules. Experiments showed that the model and framework present effective performance improvements compared to previous work.
PELEC

DOE Office of Scientific and Technical Information (OSTI.GOV)

2017-05-17

PeleC is an adaptive-mesh compressible hydrodynamics code for reacting flows. It solves the compressible Navier-Stokes with multispecies transport in a block structured framework. The resulting algorithm is well suited for flows with localized resolution requirements and robust to discontinuities. User controllable refinement crieteria has the potential to result in extremely small numerical dissipation and dispersion, making this code appropriate for both research and applied usage. The code is built on the AMReX library which facilitates hierarchical parallelism and manages distributed memory parallism. PeleC algorithms are implemented to express shared memory parallelism.
Tolerant (parallel) Programming

NASA Technical Reports Server (NTRS)

DiNucci, David C.; Bailey, David H. (Technical Monitor)

1997-01-01

In order to be truly portable, a program must be tolerant of a wide range of development and execution environments, and a parallel program is just one which must be tolerant of a very wide range. This paper first defines the term "tolerant programming", then describes many layers of tools to accomplish it. The primary focus is on F-Nets, a formal model for expressing computation as a folded partial-ordering of operations, thereby providing an architecture-independent expression of tolerant parallel algorithms. For implementing F-Nets, Cooperative Data Sharing (CDS) is a subroutine package for implementing communication efficiently in a large number of environments (e.g. shared memory and message passing). Software Cabling (SC), a very-high-level graphical programming language for building large F-Nets, possesses many of the features normally expected from today's computer languages (e.g. data abstraction, array operations). Finally, L2(sup 3) is a CASE tool which facilitates the construction, compilation, execution, and debugging of SC programs.
Parallel ALLSPD-3D: Speeding Up Combustor Analysis Via Parallel Processing

NASA Technical Reports Server (NTRS)

Fricker, David M.

1997-01-01

The ALLSPD-3D Computational Fluid Dynamics code for reacting flow simulation was run on a set of benchmark test cases to determine its parallel efficiency. These test cases included non-reacting and reacting flow simulations with varying numbers of processors. Also, the tests explored the effects of scaling the simulation with the number of processors in addition to distributing a constant size problem over an increasing number of processors. The test cases were run on a cluster of IBM RS/6000 Model 590 workstations with ethernet and ATM networking plus a shared memory SGI Power Challenge L workstation. The results indicate that the network capabilities significantly influence the parallel efficiency, i.e., a shared memory machine is fastest and ATM networking provides acceptable performance. The limitations of ethernet greatly hamper the rapid calculation of flows using ALLSPD-3D.
Holding on

ERIC Educational Resources Information Center

Thaxton, Terry Ann

2011-01-01

In this article, the author takes a multidimensional and personal look at creative writing work in an assisted living facility. The people she works with at the facility have memory loss. She shares her experience working with these people and describes a storytelling workshop that was modeled after Timeslips, a program started by Anne Basting at…
The Global File System

NASA Technical Reports Server (NTRS)

Soltis, Steven R.; Ruwart, Thomas M.; OKeefe, Matthew T.

1996-01-01

The global file system (GFS) is a prototype design for a distributed file system in which cluster nodes physically share storage devices connected via a network-like fiber channel. Networks and network-attached storage devices have advanced to a level of performance and extensibility so that the previous disadvantages of shared disk architectures are no longer valid. This shared storage architecture attempts to exploit the sophistication of storage device technologies whereas a server architecture diminishes a device's role to that of a simple component. GFS distributes the file system responsibilities across processing nodes, storage across the devices, and file system resources across the entire storage pool. GFS caches data on the storage devices instead of the main memories of the machines. Consistency is established by using a locking mechanism maintained by the storage devices to facilitate atomic read-modify-write operations. The locking mechanism is being prototyped in the Silicon Graphics IRIX operating system and is accessed using standard Unix commands and modules.
Message Passing and Shared Address Space Parallelism on an SMP Cluster

NASA Technical Reports Server (NTRS)

Shan, Hongzhang; Singh, Jaswinder P.; Oliker, Leonid; Biswas, Rupak; Biegel, Bryan (Technical Monitor)

2002-01-01

Currently, message passing (MP) and shared address space (SAS) are the two leading parallel programming paradigms. MP has been standardized with MPI, and is the more common and mature approach; however, code development can be extremely difficult, especially for irregularly structured computations. SAS offers substantial ease of programming, but may suffer from performance limitations due to poor spatial locality and high protocol overhead. In this paper, we compare the performance of and the programming effort required for six applications under both programming models on a 32-processor PC-SMP cluster, a platform that is becoming increasingly attractive for high-end scientific computing. Our application suite consists of codes that typically do not exhibit scalable performance under shared-memory programming due to their high communication-to-computation ratios and/or complex communication patterns. Results indicate that SAS can achieve about half the parallel efficiency of MPI for most of our applications, while being competitive for the others. A hybrid MPI+SAS strategy shows only a small performance advantage over pure MPI in some cases. Finally, improved implementations of two MPI collective operations on PC-SMP clusters are presented.
High-performance computing — an overview

NASA Astrophysics Data System (ADS)

Marksteiner, Peter

1996-08-01

An overview of high-performance computing (HPC) is given. Different types of computer architectures used in HPC are discussed: vector supercomputers, high-performance RISC processors, various parallel computers like symmetric multiprocessors, workstation clusters, massively parallel processors. Software tools and programming techniques used in HPC are reviewed: vectorizing compilers, optimization and vector tuning, optimization for RISC processors; parallel programming techniques like shared-memory parallelism, message passing and data parallelism; and numerical libraries.

Force user's manual, revised

NASA Technical Reports Server (NTRS)

Jordan, Harry F.; Benten, Muhammad S.; Arenstorf, Norbert S.; Ramanan, Aruna V.

1987-01-01

A methodology for writing parallel programs for shared memory multiprocessors has been formalized as an extension to the Fortran language and implemented as a macro preprocessor. The extended language is known as the Force, and this manual describes how to write Force programs and execute them on the Flexible Computer Corporation Flex/32, the Encore Multimax and the Sequent Balance computers. The parallel extension macros are described in detail, but knowledge of Fortran is assumed.
Aho-Corasick String Matching on Shared and Distributed Memory Parallel Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tumeo, Antonino; Villa, Oreste; Chavarría-Miranda, Daniel

String matching is at the core of many critical applications, including network intrusion detection systems, search engines, virus scanners, spam filters, DNA and protein sequencing, and data mining. For all of these applications string matching requires a combination of (sometimes all) the following characteristics: high and/or predictable performance, support for large data sets and flexibility of integration and customization. Many software based implementations targeting conventional cache-based microprocessors fail to achieve high and predictable performance requirements, while Field-Programmable Gate Array (FPGA) implementations and dedicated hardware solutions fail to support large data sets (dictionary sizes) and are difficult to integrate and customize.more » The advent of multicore, multithreaded, and GPU-based systems is opening the possibility for software based solutions to reach very high performance at a sustained rate. This paper compares several software-based implementations of the Aho-Corasick string searching algorithm for high performance systems. We discuss the implementation of the algorithm on several types of shared-memory high-performance architectures (Niagara 2, large x86 SMPs and Cray XMT), distributed memory with homogeneous processing elements (InfiniBand cluster of x86 multicores) and heterogeneous processing elements (InfiniBand cluster of x86 multicores with NVIDIA Tesla C10 GPUs). We describe in detail how each solution achieves the objectives of supporting large dictionaries, sustaining high performance, and enabling customization and flexibility using various data sets.« less
Information Regarding the Effect of Applying the Representative Tax System to the General Revenue Sharing, Medicaid, and Vocational Education Programs.

ERIC Educational Resources Information Center

General Accounting Office, Washington, DC.

The General Accounting Office assessed the likely impact of replacing personal income with the Representative Tax System (RTS) on the distribution of federal aid among the states in three formula-based programs. These programs were the General Fiscal Assistance Act of 1972, known as the Revenue Sharing program; Title XIX of the Social Security…
Array distribution in data-parallel programs

NASA Technical Reports Server (NTRS)

Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Sheffler, Thomas J.

1994-01-01

We consider distribution at compile time of the array data in a distributed-memory implementation of a data-parallel program written in a language like Fortran 90. We allow dynamic redistribution of data and define a heuristic algorithmic framework that chooses distribution parameters to minimize an estimate of program completion time. We represent the program as an alignment-distribution graph. We propose a divide-and-conquer algorithm for distribution that initially assigns a common distribution to each node of the graph and successively refines this assignment, taking computation, realignment, and redistribution costs into account. We explain how to estimate the effect of distribution on computation cost and how to choose a candidate set of distributions. We present the results of an implementation of our algorithms on several test problems.
Accelerate quasi Monte Carlo method for solving systems of linear algebraic equations through shared memory

NASA Astrophysics Data System (ADS)

Lai, Siyan; Xu, Ying; Shao, Bo; Guo, Menghan; Lin, Xiaola

2017-04-01

In this paper we study on Monte Carlo method for solving systems of linear algebraic equations (SLAE) based on shared memory. Former research demostrated that GPU can effectively speed up the computations of this issue. Our purpose is to optimize Monte Carlo method simulation on GPUmemoryachritecture specifically. Random numbers are organized to storein shared memory, which aims to accelerate the parallel algorithm. Bank conflicts can be avoided by our Collaborative Thread Arrays(CTA)scheme. The results of experiments show that the shared memory based strategy can speed up the computaions over than 3X at most.
Monitoring Data-Structure Evolution in Distributed Message-Passing Programs

NASA Technical Reports Server (NTRS)

Sarukkai, Sekhar R.; Beers, Andrew; Woodrow, Thomas S. (Technical Monitor)

1996-01-01

Monitoring the evolution of data structures in parallel and distributed programs, is critical for debugging its semantics and performance. However, the current state-of-art in tracking and presenting data-structure information on parallel and distributed environments is cumbersome and does not scale. In this paper we present a methodology that automatically tracks memory bindings (not the actual contents) of static and dynamic data-structures of message-passing C programs, using PVM. With the help of a number of examples we show that in addition to determining the impact of memory allocation overheads on program performance, graphical views can help in debugging the semantics of program execution. Scalable animations of virtual address bindings of source-level data-structures are used for debugging the semantics of parallel programs across all processors. In conjunction with light-weight core-files, this technique can be used to complement traditional debuggers on single processors. Detailed information (such as data-structure contents), on specific nodes, can be determined using traditional debuggers after the data structure evolution leading to the semantic error is observed graphically.
Practical Formal Verification of MPI and Thread Programs

NASA Astrophysics Data System (ADS)

Gopalakrishnan, Ganesh; Kirby, Robert M.

Large-scale simulation codes in science and engineering are written using the Message Passing Interface (MPI). Shared memory threads are widely used directly, or to implement higher level programming abstractions. Traditional debugging methods for MPI or thread programs are incapable of providing useful formal guarantees about coverage. They get bogged down in the sheer number of interleavings (schedules), often missing shallow bugs. In this tutorial we will introduce two practical formal verification tools: ISP (for MPI C programs) and Inspect (for Pthread C programs). Unlike other formal verification tools, ISP and Inspect run directly on user source codes (much like a debugger). They pursue only the relevant set of process interleavings, using our own customized Dynamic Partial Order Reduction algorithms. For a given test harness, DPOR allows these tools to guarantee the absence of deadlocks, instrumented MPI object leaks and communication races (using ISP), and shared memory races (using Inspect). ISP and Inspect have been used to verify large pieces of code: in excess of 10,000 lines of MPI/C for ISP in under 5 seconds, and about 5,000 lines of Pthread/C code in a few hours (and much faster with the use of a cluster or by exploiting special cases such as symmetry) for Inspect. We will also demonstrate the Microsoft Visual Studio and Eclipse Parallel Tools Platform integrations of ISP (these will be available on the LiveCD).
CaLRS: A Critical-Aware Shared LLC Request Scheduling Algorithm on GPGPU

PubMed Central

Ma, Jianliang; Meng, Jinglei; Chen, Tianzhou; Wu, Minghui

2015-01-01

Ultra high thread-level parallelism in modern GPUs usually introduces numerous memory requests simultaneously. So there are always plenty of memory requests waiting at each bank of the shared LLC (L2 in this paper) and global memory. For global memory, various schedulers have already been developed to adjust the request sequence. But we find few work has ever focused on the service sequence on the shared LLC. We measured that a big number of GPU applications always queue at LLC bank for services, which provide opportunity to optimize the service order on LLC. Through adjusting the GPU memory request service order, we can improve the schedulability of SM. So we proposed a critical-aware shared LLC request scheduling algorithm (CaLRS) in this paper. The priority representative of memory request is critical for CaLRS. We use the number of memory requests that originate from the same warp but have not been serviced when they arrive at the shared LLC bank to represent the criticality of each warp. Experiments show that the proposed scheme can boost the SM schedulability effectively by promoting the scheduling priority of the memory requests with high criticality and improves the performance of GPU indirectly. PMID:25729772
Parallelization of Program to Optimize Simulated Trajectories (POST3D)

NASA Technical Reports Server (NTRS)

Hammond, Dana P.; Korte, John J. (Technical Monitor)

2001-01-01

This paper describes the parallelization of the Program to Optimize Simulated Trajectories (POST3D). POST3D uses a gradient-based optimization algorithm that reaches an optimum design point by moving from one design point to the next. The gradient calculations required to complete the optimization process, dominate the computational time and have been parallelized using a Single Program Multiple Data (SPMD) on a distributed memory NUMA (non-uniform memory access) architecture. The Origin2000 was used for the tests presented.
Learning science as a potential new source of understanding and improvement for continuing education and continuing professional development.

PubMed

Van Hoof, Thomas J; Doyle, Terrence J

2018-01-15

Learning science is an emerging interdisciplinary field that offers educators key insights about what happens in the brain when learning occurs. In addition to explanations about the learning process, which includes memory and involves different parts of the brain, learning science offers effective strategies to inform the planning and implementation of activities and programs in continuing education and continuing professional development. This article provides a brief description of learning, including the three key steps of encoding, consolidation and retrieval. The article also introduces four major learning-science strategies, known as distributed learning, retrieval practice, interleaving, and elaboration, which share the importance of considerable practice. Finally, the article describes how learning science aligns with the general findings from the most recent synthesis of systematic reviews about the effectiveness of continuing medical education.
Efficient partitioning and assignment on programs for multiprocessor execution

NASA Technical Reports Server (NTRS)

Standley, Hilda M.

1993-01-01

The general problem studied is that of segmenting or partitioning programs for distribution across a multiprocessor system. Efficient partitioning and the assignment of program elements are of great importance since the time consumed in this overhead activity may easily dominate the computation, effectively eliminating any gains made by the use of the parallelism. In this study, the partitioning of sequentially structured programs (written in FORTRAN) is evaluated. Heuristics, developed for similar applications are examined. Finally, a model for queueing networks with finite queues is developed which may be used to analyze multiprocessor system architectures with a shared memory approach to the problem of partitioning. The properties of sequentially written programs form obstacles to large scale (at the procedure or subroutine level) parallelization. Data dependencies of even the minutest nature, reflecting the sequential development of the program, severely limit parallelism. The design of heuristic algorithms is tied to the experience gained in the parallel splitting. Parallelism obtained through the physical separation of data has seen some success, especially at the data element level. Data parallelism on a grander scale requires models that accurately reflect the effects of blocking caused by finite queues. A model for the approximation of the performance of finite queueing networks is developed. This model makes use of the decomposition approach combined with the efficiency of product form solutions.
Time Constraints and Resource Sharing in Adults' Working Memory Spans

ERIC Educational Resources Information Center

Barrouillet, Pierre; Bernardin, Sophie; Camos, Valerie

2004-01-01

This article presents a new model that accounts for working memory spans in adults, the time-based resource-sharing model. The model assumes that both components (i.e., processing and maintenance) of the main working memory tasks require attention and that memory traces decay as soon as attention is switched away. Because memory retrievals are…
Channel doping concentration and cell program state dependence on random telegraph noise spatial and statistical distribution in 30 nm NAND flash memory

NASA Astrophysics Data System (ADS)

Tomita, Toshihiro; Miyaji, Kousuke

2015-04-01

The dependence of spatial and statistical distribution of random telegraph noise (RTN) in a 30 nm NAND flash memory on channel doping concentration NA and cell program state Vth is comprehensively investigated using three-dimensional Monte Carlo device simulation considering random dopant fluctuation (RDF). It is found that single trap RTN amplitude ΔVth is larger at the center of the channel region in the NAND flash memory, which is closer to the jellium (uniform) doping results since NA is relatively low to suppress junction leakage current. In addition, ΔVth peak at the center of the channel decreases in the higher Vth state due to the current concentration at the shallow trench isolation (STI) edges induced by the high vertical electrical field through the fringing capacitance between the channel and control gate. In such cases, ΔVth distribution slope λ cannot be determined by only considering RDF and single trap.
Multi-processor including data flow accelerator module

DOEpatents

Davidson, George S.; Pierce, Paul E.

1990-01-01

An accelerator module for a data flow computer includes an intelligent memory. The module is added to a multiprocessor arrangement and uses a shared tagged memory architecture in the data flow computer. The intelligent memory module assigns locations for holding data values in correspondence with arcs leading to a node in a data dependency graph. Each primitive computation is associated with a corresponding memory cell, including a number of slots for operands needed to execute a primitive computation, a primitive identifying pointer, and linking slots for distributing the result of the cell computation to other cells requiring that result as an operand. Circuitry is provided for utilizing tag bits to determine automatically when all operands required by a processor are available and for scheduling the primitive for execution in a queue. Each memory cell of the module may be associated with any of the primitives, and the particular primitive to be executed by the processor associated with the cell is identified by providing an index, such as the cell number for the primitive, to the primitive lookup table of starting addresses. The module thus serves to perform functions previously performed by a number of sections of data flow architectures and coexists with conventional shared memory therein. A multiprocessing system including the module operates in a hybrid mode, wherein the same processing modules are used to perform some processing in a sequential mode, under immediate control of an operating system, while performing other processing in a data flow mode.
Externalising the autobiographical self: sharing personal memories online facilitated memory retention.

PubMed

Wang, Qi; Lee, Dasom; Hou, Yubo

2017-07-01

Internet technology provides a new means of recalling and sharing personal memories in the digital age. What is the mnemonic consequence of posting personal memories online? Theories of transactive memory and autobiographical memory would make contrasting predictions. In the present study, college students completed a daily diary for a week, listing at the end of each day all the events that happened to them on that day. They also reported whether they posted any of the events online. Participants received a surprise memory test after the completion of the diary recording and then another test a week later. At both tests, events posted online were significantly more likely than those not posted online to be recalled. It appears that sharing memories online may provide unique opportunities for rehearsal and meaning-making that facilitate memory retention.
Using process groups to implement failure detection in asynchronous environments

NASA Technical Reports Server (NTRS)

Ricciardi, Aleta M.; Birman, Kenneth P.

1991-01-01

Agreement on the membership of a group of processes in a distributed system is a basic problem that arises in a wide range of applications. Such groups occur when a set of processes cooperate to perform some task, share memory, monitor one another, subdivide a computation, and so forth. The group membership problems is discussed as it relates to failure detection in asynchronous, distributed systems. A rigorous, formal specification for group membership is presented under this interpretation. A solution is then presented for this problem.
A simple GPU-accelerated two-dimensional MUSCL-Hancock solver for ideal magnetohydrodynamics

NASA Astrophysics Data System (ADS)

Bard, Christopher M.; Dorelli, John C.

2014-02-01

We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of ≈126 for a 10242 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.
Study Trapped Charge Distribution in P-Channel Silicon-Oxide-Nitride-Oxide-Silicon Memory Device Using Dynamic Programming Scheme

NASA Astrophysics Data System (ADS)

Li, Fu-Hai; Chiu, Yung-Yueh; Lee, Yen-Hui; Chang, Ru-Wei; Yang, Bo-Jun; Sun, Wein-Town; Lee, Eric; Kuo, Chao-Wei; Shirota, Riichiro

2013-04-01

In this study, we precisely investigate the charge distribution in SiN layer by dynamic programming of channel hot hole induced hot electron injection (CHHIHE) in p-channel silicon-oxide-nitride-oxide-silicon (SONOS) memory device. In the dynamic programming scheme, gate voltage is increased as a staircase with fixed step amplitude, which can prohibits the injection of holes in SiN layer. Three-dimensional device simulation is calibrated and is compared with the measured programming characteristics. It is found, for the first time, that the hot electron injection point quickly traverses from drain to source side synchronizing to the expansion of charged area in SiN layer. As a result, the injected charges quickly spread over on the almost whole channel area uniformly during a short programming period, which will afford large tolerance against lateral trapped charge diffusion by baking.
Early Experiences Writing Performance Portable OpenMP 4 Codes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Joubert, Wayne; Hernandez, Oscar R

In this paper, we evaluate the recently available directives in OpenMP 4 to parallelize a computational kernel using both the traditional shared memory approach and the newer accelerator targeting capabilities. In addition, we explore various transformations that attempt to increase application performance portability, and examine the expressiveness and performance implications of using these approaches. For example, we want to understand if the target map directives in OpenMP 4 improve data locality when mapped to a shared memory system, as opposed to the traditional first touch policy approach in traditional OpenMP. To that end, we use recent Cray and Intel compilersmore » to measure the performance variations of a simple application kernel when executed on the OLCF s Titan supercomputer with NVIDIA GPUs and the Beacon system with Intel Xeon Phi accelerators attached. To better understand these trade-offs, we compare our results from traditional OpenMP shared memory implementations to the newer accelerator programming model when it is used to target both the CPU and an attached heterogeneous device. We believe the results and lessons learned as presented in this paper will be useful to the larger user community by providing guidelines that can assist programmers in the development of performance portable code.« less
Message Passing vs. Shared Address Space on a Cluster of SMPs

NASA Technical Reports Server (NTRS)

Shan, Hongzhang; Singh, Jaswinder Pal; Oliker, Leonid; Biswas, Rupak

2000-01-01

The convergence of scalable computer architectures using clusters of PCs (or PC-SMPs) with commodity networking has become an attractive platform for high end scientific computing. Currently, message-passing and shared address space (SAS) are the two leading programming paradigms for these systems. Message-passing has been standardized with MPI, and is the most common and mature programming approach. However message-passing code development can be extremely difficult, especially for irregular structured computations. SAS offers substantial ease of programming, but may suffer from performance limitations due to poor spatial locality, and high protocol overhead. In this paper, we compare the performance of and programming effort, required for six applications under both programming models on a 32 CPU PC-SMP cluster. Our application suite consists of codes that typically do not exhibit high efficiency under shared memory programming. due to their high communication to computation ratios and complex communication patterns. Results indicate that SAS can achieve about half the parallel efficiency of MPI for most of our applications: however, on certain classes of problems SAS performance is competitive with MPI. We also present new algorithms for improving the PC cluster performance of MPI collective operations.

In Remembrance: September 11, 2001

ERIC Educational Resources Information Center

Haeseler, Martha P.

2002-01-01

In this article, the author shares her experience of being part of the creation of a memorial. mosaic dedicated to those who had died on September 11, 2001. Working with veterans at a long-term outpatient program within a Veterans Administration (VA) Mental Hygiene Clinic, she found that the physical process of constructing something from…
MLP: A Parallel Programming Alternative to MPI for New Shared Memory Parallel Systems

NASA Technical Reports Server (NTRS)

Taft, James R.

1999-01-01

Recent developments at the NASA AMES Research Center's NAS Division have demonstrated that the new generation of NUMA based Symmetric Multi-Processing systems (SMPs), such as the Silicon Graphics Origin 2000, can successfully execute legacy vector oriented CFD production codes at sustained rates far exceeding processing rates possible on dedicated 16 CPU Cray C90 systems. This high level of performance is achieved via shared memory based Multi-Level Parallelism (MLP). This programming approach, developed at NAS and outlined below, is distinct from the message passing paradigm of MPI. It offers parallelism at both the fine and coarse grained level, with communication latencies that are approximately 50-100 times lower than typical MPI implementations on the same platform. Such latency reductions offer the promise of performance scaling to very large CPU counts. The method draws on, but is also distinct from, the newly defined OpenMP specification, which uses compiler directives to support a limited subset of multi-level parallel operations. The NAS MLP method is general, and applicable to a large class of NASA CFD codes.
A mixed parallel strategy for the solution of coupled multi-scale problems at finite strains

NASA Astrophysics Data System (ADS)

Lopes, I. A. Rodrigues; Pires, F. M. Andrade; Reis, F. J. P.

2018-02-01

A mixed parallel strategy for the solution of homogenization-based multi-scale constitutive problems undergoing finite strains is proposed. The approach aims to reduce the computational time and memory requirements of non-linear coupled simulations that use finite element discretization at both scales (FE^2). In the first level of the algorithm, a non-conforming domain decomposition technique, based on the FETI method combined with a mortar discretization at the interface of macroscopic subdomains, is employed. A master-slave scheme, which distributes tasks by macroscopic element and adopts dynamic scheduling, is then used for each macroscopic subdomain composing the second level of the algorithm. This strategy allows the parallelization of FE^2 simulations in computers with either shared memory or distributed memory architectures. The proposed strategy preserves the quadratic rates of asymptotic convergence that characterize the Newton-Raphson scheme. Several examples are presented to demonstrate the robustness and efficiency of the proposed parallel strategy.
High Band Technology Program (HiTeP)

DTIC Science & Technology

2005-03-01

clock distribution circuit. One Receiver Memory module receives 60MHz reference sine wave and distributes 60MHz clock signals to all Receiver Memory...Diagram UNCLASSIFIED 23 in N00014-99-C-0314 Integrated Defense Systems Final Report 1 March 2005 .. 4.ran FibreXpress Fibre-Channel PMC "Motrl Medea FCR...the Electrically Short Crossed-Notch (ESCN). It is shorter than traditional traveling wave notch antennas. The 2X ECSN fin length is approximately 1.2
Implementing Molecular Dynamics for Hybrid High Performance Computers - 1. Short Range Forces

DOE Office of Scientific and Technical Information (OSTI.GOV)

Brown, W Michael; Wang, Peng; Plimpton, Steven J

The use of accelerators such as general-purpose graphics processing units (GPGPUs) have become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high performance computers, machines with more than one type of floating-point processor, are now becoming more prevalent due to these advantages. In this work, we discuss several important issues in porting a large molecular dynamics code for use on parallel hybrid machines - 1) choosing a hybrid parallel decomposition that works on central processing units (CPUs) with distributed memory and accelerator cores with shared memory,more » 2) minimizing the amount of code that must be ported for efficient acceleration, 3) utilizing the available processing power from both many-core CPUs and accelerators, and 4) choosing a programming model for acceleration. We present our solution to each of these issues for short-range force calculation in the molecular dynamics package LAMMPS. We describe algorithms for efficient short range force calculation on hybrid high performance machines. We describe a new approach for dynamic load balancing of work between CPU and accelerator cores. We describe the Geryon library that allows a single code to compile with both CUDA and OpenCL for use on a variety of accelerators. Finally, we present results on a parallel test cluster containing 32 Fermi GPGPUs and 180 CPU cores.« less
An Interactive Simulation Program for Exploring Computational Models of Auto-Associative Memory.

PubMed

Fink, Christian G

2017-01-01

While neuroscience students typically learn about activity-dependent plasticity early in their education, they often struggle to conceptually connect modification at the synaptic scale with network-level neuronal dynamics, not to mention with their own everyday experience of recalling a memory. We have developed an interactive simulation program (based on the Hopfield model of auto-associative memory) that enables the user to visualize the connections generated by any pattern of neural activity, as well as to simulate the network dynamics resulting from such connectivity. An accompanying set of student exercises introduces the concepts of pattern completion, pattern separation, and sparse versus distributed neural representations. Results from a conceptual assessment administered before and after students worked through these exercises indicate that the simulation program is a useful pedagogical tool for illustrating fundamental concepts of computational models of memory.
Supercomputing '91; Proceedings of the 4th Annual Conference on High Performance Computing, Albuquerque, NM, Nov. 18-22, 1991

NASA Technical Reports Server (NTRS)

1991-01-01

Various papers on supercomputing are presented. The general topics addressed include: program analysis/data dependence, memory access, distributed memory code generation, numerical algorithms, supercomputer benchmarks, latency tolerance, parallel programming, applications, processor design, networks, performance tools, mapping and scheduling, characterization affecting performance, parallelism packaging, computing climate change, combinatorial algorithms, hardware and software performance issues, system issues. (No individual items are abstracted in this volume)
Memory systems in schizophrenia: Modularity is preserved but deficits are generalized.

PubMed

Haut, Kristen M; Karlsgodt, Katherine H; Bilder, Robert M; Congdon, Eliza; Freimer, Nelson B; London, Edythe D; Sabb, Fred W; Ventura, Joseph; Cannon, Tyrone D

2015-10-01

Schizophrenia patients exhibit impaired working and episodic memory, but this may represent generalized impairment across memory modalities or performance deficits restricted to particular memory systems in subgroups of patients. Furthermore, it is unclear whether deficits are unique from those associated with other disorders. Healthy controls (n=1101) and patients with schizophrenia (n=58), bipolar disorder (n=49) and attention-deficit-hyperactivity-disorder (n=46) performed 18 tasks addressing primarily verbal and spatial episodic and working memory. Effect sizes for group contrasts were compared across tasks and the consistency of subjects' distributional positions across memory domains was measured. Schizophrenia patients performed poorly relative to the other groups on every test. While low to moderate correlation was found between memory domains (r=.320), supporting modularity of these systems, there was limited agreement between measures regarding each individual's task performance (ICC=.292) and in identifying those individuals falling into the lowest quintile (kappa=0.259). A general ability factor accounted for nearly all of the group differences in performance and agreement across measures in classifying low performers. Pathophysiological processes involved in schizophrenia appear to act primarily on general abilities required in all tasks rather than on specific abilities within different memory domains and modalities. These effects represent a general shift in the overall distribution of general ability (i.e., each case functioning at a lower level than they would have if not for the illness), rather than presence of a generally low-performing subgroup of patients. There is little evidence that memory impairments in schizophrenia are shared with bipolar disorder and ADHD. Copyright © 2015 Elsevier B.V. All rights reserved.
Memory systems in schizophrenia: Modularity is preserved but deficits are generalized

PubMed Central

Haut, Kristen M.; Karlsgodt, Katherine H.; Bilder, Robert M.; Congdon, Eliza; Freimer, Nelson; London, Edythe D.; Sabb, Fred W.; Ventura, Joseph; Cannon, Tyrone D.

2015-01-01

Objective Schizophrenia patients exhibit impaired working and episodic memory, but this may represent generalized impairment across memory modalities or performance deficits restricted to particular memory systems in subgroups of patients. Furthermore, it is unclear whether deficits are unique from those associated with other disorders. Method Healthy controls (n=1101) and patients with schizophrenia (n=58), bipolar disorder (n=49) and attention-deficit-hyperactivity-disorder (n=46) performed 18 tasks addressing primarily verbal and spatial episodic and working memory. Effect sizes for group contrasts were compared across tasks and the consistency of subjects’ distributional positions across memory domains was measured. Results Schizophrenia patients performed poorly relative to the other groups on every test. While low to moderate correlation was found between memory domains (r=.320), supporting modularity of these systems, there was limited agreement between measures regarding each individual’s task performance (ICC=.292) and in identifying those individuals falling into the lowest quintile (kappa=0.259). A general ability factor accounted for nearly all of the group differences in performance and agreement across measures in classifying low performers. Conclusions Pathophysiological processes involved in schizophrenia appear to act primarily on general abilities required in all tasks rather than on specific abilities within different memory domains and modalities. These effects represent a general shift in the overall distribution of general ability (i.e., each case functioning at a lower level than they would have if not for the illness), rather than presence of a generally low-performing subgroup of patients. There is little evidence that memory impairments in schizophrenia are shared with bipolar disorder and ADHD. PMID:26299707
Epigenetic Networks Regulate the Transcriptional Program in Memory and Terminally Differentiated CD8+ T Cells.

PubMed

Rodriguez, Ramon M; Suarez-Alvarez, Beatriz; Lavín, José L; Mosén-Ansorena, David; Baragaño Raneros, Aroa; Márquez-Kisinousky, Leonardo; Aransay, Ana M; Lopez-Larrea, Carlos

2017-01-15

Epigenetic mechanisms play a critical role during differentiation of T cells by contributing to the formation of stable and heritable transcriptional patterns. To better understand the mechanisms of memory maintenance in CD8 + T cells, we performed genome-wide analysis of DNA methylation, histone marking (acetylated lysine 9 in histone H3 and trimethylated lysine 9 in histone), and gene-expression profiles in naive, effector memory (EM), and terminally differentiated EM (TEMRA) cells. Our results indicate that DNA demethylation and histone acetylation are coordinated to generate the transcriptional program associated with memory cells. Conversely, EM and TEMRA cells share a very similar epigenetic landscape. Nonetheless, the TEMRA transcriptional program predicts an innate immunity phenotype associated with genes never reported in these cells, including several mediators of NK cell activation (VAV3 and LYN) and a large array of NK receptors (e.g., KIR2DL3, KIR2DL4, KIR2DL1, KIR3DL1, KIR2DS5). In addition, we identified up to 161 genes that encode transcriptional regulators, some of unknown function in CD8 + T cells, and that were differentially expressed in the course of differentiation. Overall, these results provide new insights into the regulatory networks involved in memory CD8 + T cell maintenance and T cell terminal differentiation. Copyright © 2017 by The American Association of Immunologists, Inc.
A mechanism for efficient debugging of parallel programs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Miller, B.P.; Choi, J.D.

1988-01-01

This paper addresses the design and implementation of an integrated debugging system for parallel programs running on shared memory multi-processors (SMMP). The authors describe the use of flowback analysis to provide information on causal relationships between events in a program's execution without re-executing the program for debugging. The authors introduce a mechanism called incremental tracing that, by using semantic analyses of the debugged program, makes the flowback analysis practical with only a small amount of trace generated during execution. The extend flowback analysis to apply to parallel programs and describe a method to detect race conditions in the interactions ofmore » the co-operating processes.« less
Software-Controlled Caches in the VMP Multiprocessor

DTIC Science & Technology

1986-03-01

programming system level that Processors is tuned for the VMP design. In this vein, we are interested in exploring how far the software support can go to ...handled in software, analogously to the handling agement of the shared program state is familiar and of virtual memory page faults. Hardware support for...ensure good behavior, as opposed to how Each cache miss results in bus traffic. Table 2 pro- vides the bus cost for the "average" cache miss. Fig
High Performance Programming Using Explicit Shared Memory Model on the Cray T3D

NASA Technical Reports Server (NTRS)

Saini, Subhash; Simon, Horst D.; Lasinski, T. A. (Technical Monitor)

1994-01-01

The Cray T3D is the first-phase system in Cray Research Inc.'s (CRI) three-phase massively parallel processing program. In this report we describe the architecture of the T3D, as well as the CRAFT (Cray Research Adaptive Fortran) programming model, and contrast it with PVM, which is also supported on the T3D We present some performance data based on the NAS Parallel Benchmarks to illustrate both architectural and software features of the T3D.
Visual and Spatial Working Memory Are Not that Dissociated after All: A Time-Based Resource-Sharing Account

ERIC Educational Resources Information Center

Vergauwe, Evie; Barrouillet, Pierre; Camos, Valerie

2009-01-01

Examinations of interference between visual and spatial materials in working memory have suggested domain- and process-based fractionations of visuo-spatial working memory. The present study examined the role of central time-based resource sharing in visuo-spatial working memory and assessed its role in obtained interference patterns. Visual and…
An implementation of the SNR high speed network communication protocol (Receiver part)

NASA Astrophysics Data System (ADS)

Wan, Wen-Jyh

1995-03-01

This thesis work is to implement the receiver pan of the SNR high speed network transport protocol. The approach was to use the Systems of Communicating Machines (SCM) as the formal definition of the protocol. Programs were developed on top of the Unix system using C programming language. The Unix system features that were adopted for this implementation were multitasking, signals, shared memory, semaphores, sockets, timers and process control. The problems encountered, and solved, were signal loss, shared memory conflicts, process synchronization, scheduling, data alignment and errors in the SCM specification itself. The result was a correctly functioning program which implemented the SNR protocol. The system was tested using different connection modes, lost packets, duplicate packets and large data transfers. The contributions of this thesis are: (1) implementation of the receiver part of the SNR high speed transport protocol; (2) testing and integration with the transmitter part of the SNR transport protocol on an FDDI data link layered network; (3) demonstration of the functions of the SNR transport protocol such as connection management, sequenced delivery, flow control and error recovery using selective repeat methods of retransmission; and (4) modifications to the SNR transport protocol specification such as corrections for incorrect predicate conditions, defining of additional packet types formats, solutions for signal lost and processes contention problems etc.
Programming parallel architectures: The BLAZE family of languages

NASA Technical Reports Server (NTRS)

Mehrotra, Piyush

1988-01-01

Programming multiprocessor architectures is a critical research issue. An overview is given of the various approaches to programming these architectures that are currently being explored. It is argued that two of these approaches, interactive programming environments and functional parallel languages, are particularly attractive since they remove much of the burden of exploiting parallel architectures from the user. Also described is recent work by the author in the design of parallel languages. Research on languages for both shared and nonshared memory multiprocessors is described, as well as the relations of this work to other current language research projects.
Debugging Fortran on a shared memory machine

DOE Office of Scientific and Technical Information (OSTI.GOV)

Allen, T.R.; Padua, D.A.

1987-01-01

Debugging on a parallel processor is more difficult than debugging on a serial machine because errors in a parallel program may introduce nondeterminism. The approach to parallel debugging presented here attempts to reduce the problem of debugging on a parallel machine to that of debugging on a serial machine by automatically detecting nondeterminism. 20 refs., 6 figs.
A Simple GPU-Accelerated Two-Dimensional MUSCL-Hancock Solver for Ideal Magnetohydrodynamics

NASA Technical Reports Server (NTRS)

Bard, Christopher; Dorelli, John C.

2013-01-01

We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of approx. = 126 for a sq 1024 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.
A pervasive parallel framework for visualization: final report for FWP 10-014707

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moreland, Kenneth D.

2014-01-01

We are on the threshold of a transformative change in the basic architecture of highperformance computing. The use of accelerator processors, characterized by large core counts, shared but asymmetrical memory, and heavy thread loading, is quickly becoming the norm in high performance computing. These accelerators represent significant challenges in updating our existing base of software. An intrinsic problem with this transition is a fundamental programming shift from message passing processes to much more fine thread scheduling with memory sharing. Another problem is the lack of stability in accelerator implementation; processor and compiler technology is currently changing rapidly. This report documentsmore » the results of our three-year ASCR project to address these challenges. Our project includes the development of the Dax toolkit, which contains the beginnings of new algorithms for a new generation of computers and the underlying infrastructure to rapidly prototype and build further algorithms as necessary.« less
Rapid recovery from transient faults in the fault-tolerant processor with fault-tolerant shared memory

NASA Technical Reports Server (NTRS)

Harper, Richard E.; Butler, Bryan P.

1990-01-01

The Draper fault-tolerant processor with fault-tolerant shared memory (FTP/FTSM), which is designed to allow application tasks to continue execution during the memory alignment process, is described. Processor performance is not affected by memory alignment. In addition, the FTP/FTSM incorporates a hardware scrubber device to perform the memory alignment quickly during unused memory access cycles. The FTP/FTSM architecture is described, followed by an estimate of the time required for channel reintegration.

Paradigms and strategies for scientific computing on distributed memory concurrent computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Foster, I.T.; Walker, D.W.

1994-06-01

In this work we examine recent advances in parallel languages and abstractions that have the potential for improving the programmability and maintainability of large-scale, parallel, scientific applications running on high performance architectures and networks. This paper focuses on Fortran M, a set of extensions to Fortran 77 that supports the modular design of message-passing programs. We describe the Fortran M implementation of a particle-in-cell (PIC) plasma simulation application, and discuss issues in the optimization of the code. The use of two other methodologies for parallelizing the PIC application are considered. The first is based on the shared object abstraction asmore » embodied in the Orca language. The second approach is the Split-C language. In Fortran M, Orca, and Split-C the ability of the programmer to control the granularity of communication is important is designing an efficient implementation.« less
Relations of maternal style and child self-concept to autobiographical memories in chinese, chinese immigrant, and European american 3-year-olds.

PubMed

Wang, Qi

2006-01-01

The relations of maternal reminiscing style and child self-concept to children's shared and independent autobiographical memories were examined in a sample of 189 three-year-olds and their mothers from Chinese families in China, first-generation Chinese immigrant families in the United States, and European American families. Mothers shared memories with their children and completed questionnaires; children recounted autobiographical events and described themselves with a researcher. Independent of culture, gender, child age, and language skills, maternal elaborations and evaluations were associated with children's shared memory reports, and maternal evaluations and child agentic self-focus were associated with children's independent memory reports. Maternal style and child self-concept further mediated cultural influences on children's memory. The findings provide insight into the social-cultural construction of autobiographical memory.
Memory Allocation: Mechanisms and Function.

PubMed

Josselyn, Sheena A; Frankland, Paul W

2018-04-25

Memories for events are thought to be represented in sparse, distributed neuronal ensembles (or engrams). In this article, we review how neurons are chosen to become part of a particular engram, via a process of neuronal allocation. Experiments in rodents indicate that eligible neurons compete for allocation to a given engram, with more excitable neurons winning this competition. Moreover, fluctuations in neuronal excitability determine how engrams interact, promoting either memory integration (via coallocation to overlapping engrams) or separation (via disallocation to nonoverlapping engrams). In parallel with rodent studies, recent findings in humans verify the importance of this memory integration process for linking memories that occur close in time or share related content. A deeper understanding of allocation promises to provide insights into the logic underlying how knowledge is normally organized in the brain and the disorders in which this process has gone awry. Expected final online publication date for the Annual Review of Neuroscience Volume 41 is July 8, 2018. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
SAHAYOG: A Testbed for Load Sharing under Failure,

DTIC Science & Technology

1987-07-01

messages, shared memory and semaphores . To communicate using messages, processes create message queues using system-provided prim- itives. The message...The size of the memory that is to be shared is decided by the process when it makes a request for memory allocation. The semaphore option of IPC can be...used to prevent two or more concurrent processes from executing their critical sections at the same time. Semaphores must be used when the processes
Implementation and evaluation of shared-memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Buntinas, D.; Mercier, G.; Gropp, W.

2007-09-01

This paper presents the implementation of MPICH2 over the Nemesis communication subsystem and the evaluation of its shared-memory performance. We describe design issues as well as some of the optimization techniques we employed. We conducted a performance evaluation over shared memory using microbenchmarks. The evaluation shows that MPICH2 Nemesis has very low communication overhead, making it suitable for smaller-grained applications.
Hybrid MPI/OpenMP Implementation of the ORAC Molecular Dynamics Program for Generalized Ensemble and Fast Switching Alchemical Simulations.

PubMed

Procacci, Piero

2016-06-27

We present a new release (6.0β) of the ORAC program [Marsili et al. J. Comput. Chem. 2010, 31, 1106-1116] with a hybrid OpenMP/MPI (open multiprocessing message passing interface) multilevel parallelism tailored for generalized ensemble (GE) and fast switching double annihilation (FS-DAM) nonequilibrium technology aimed at evaluating the binding free energy in drug-receptor system on high performance computing platforms. The production of the GE or FS-DAM trajectories is handled using a weak scaling parallel approach on the MPI level only, while a strong scaling force decomposition scheme is implemented for intranode computations with shared memory access at the OpenMP level. The efficiency, simplicity, and inherent parallel nature of the ORAC implementation of the FS-DAM algorithm, project the code as a possible effective tool for a second generation high throughput virtual screening in drug discovery and design. The code, along with documentation, testing, and ancillary tools, is distributed under the provisions of the General Public License and can be freely downloaded at www.chim.unifi.it/orac .
3-D parallel program for numerical calculation of gas dynamics problems with heat conductivity on distributed memory computational systems (CS)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sofronov, I.D.; Voronin, B.L.; Butnev, O.I.

1997-12-31

The aim of the work performed is to develop a 3D parallel program for numerical calculation of gas dynamics problem with heat conductivity on distributed memory computational systems (CS), satisfying the condition of numerical result independence from the number of processors involved. Two basically different approaches to the structure of massive parallel computations have been developed. The first approach uses the 3D data matrix decomposition reconstructed at temporal cycle and is a development of parallelization algorithms for multiprocessor CS with shareable memory. The second approach is based on using a 3D data matrix decomposition not reconstructed during a temporal cycle.more » The program was developed on 8-processor CS MP-3 made in VNIIEF and was adapted to a massive parallel CS Meiko-2 in LLNL by joint efforts of VNIIEF and LLNL staffs. A large number of numerical experiments has been carried out with different number of processors up to 256 and the efficiency of parallelization has been evaluated in dependence on processor number and their parameters.« less
Numerical methods on some structured matrix algebra problems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jessup, E.R.

1996-06-01

This proposal concerned the design, analysis, and implementation of serial and parallel algorithms for certain structured matrix algebra problems. It emphasized large order problems and so focused on methods that can be implemented efficiently on distributed-memory MIMD multiprocessors. Such machines supply the computing power and extensive memory demanded by the large order problems. We proposed to examine three classes of matrix algebra problems: the symmetric and nonsymmetric eigenvalue problems (especially the tridiagonal cases) and the solution of linear systems with specially structured coefficient matrices. As all of these are of practical interest, a major goal of this work was tomore » translate our research in linear algebra into useful tools for use by the computational scientists interested in these and related applications. Thus, in addition to software specific to the linear algebra problems, we proposed to produce a programming paradigm and library to aid in the design and implementation of programs for distributed-memory MIMD computers. We now report on our progress on each of the problems and on the programming tools.« less
Loci-STREAM Version 0.9

NASA Technical Reports Server (NTRS)

Wright, Jeffrey; Thakur, Siddharth

2006-01-01

Loci-STREAM is an evolving computational fluid dynamics (CFD) software tool for simulating possibly chemically reacting, possibly unsteady flows in diverse settings, including rocket engines, turbomachines, oil refineries, etc. Loci-STREAM implements a pressure- based flow-solving algorithm that utilizes unstructured grids. (The benefit of low memory usage by pressure-based algorithms is well recognized by experts in the field.) The algorithm is robust for flows at all speeds from zero to hypersonic. The flexibility of arbitrary polyhedral grids enables accurate, efficient simulation of flows in complex geometries, including those of plume-impingement problems. The present version - Loci-STREAM version 0.9 - includes an interface with the Portable, Extensible Toolkit for Scientific Computation (PETSc) library for access to enhanced linear-equation-solving programs therein that accelerate convergence toward a solution. The name "Loci" reflects the creation of this software within the Loci computational framework, which was developed at Mississippi State University for the primary purpose of simplifying the writing of complex multidisciplinary application programs to run in distributed-memory computing environments including clusters of personal computers. Loci has been designed to relieve application programmers of the details of programming for distributed-memory computers.
Corporate Memory and Spacecraft Course Development

NASA Technical Reports Server (NTRS)

1999-01-01

The Florida Space Grant Consortium (FSGC) entered into a cooperative agreement with NASA's Kennedy Space Center (KSC) to manage two programs, (1) Corporate Memory and (2) Spacecraft Course Development. The FSGC, with its central office at the University of Florida, administered these programs. In conjunction with KSC, the FGSC was responsible for the release of the Request for Proposals (RFP's) and its distribution to all its affiliates. FSGC received the applications, forwarded them to KSC for evaluations, and awarded grants to the institutions chosen by KSC.
Working memory resources are shared across sensory modalities.

PubMed

Salmela, V R; Moisala, M; Alho, K

2014-10-01

A common assumption in the working memory literature is that the visual and auditory modalities have separate and independent memory stores. Recent evidence on visual working memory has suggested that resources are shared between representations, and that the precision of representations sets the limit for memory performance. We tested whether memory resources are also shared across sensory modalities. Memory precision for two visual (spatial frequency and orientation) and two auditory (pitch and tone duration) features was measured separately for each feature and for all possible feature combinations. Thus, only the memory load was varied, from one to four features, while keeping the stimuli similar. In Experiment 1, two gratings and two tones-both containing two varying features-were presented simultaneously. In Experiment 2, two gratings and two tones-each containing only one varying feature-were presented sequentially. The memory precision (delayed discrimination threshold) for a single feature was close to the perceptual threshold. However, as the number of features to be remembered was increased, the discrimination thresholds increased more than twofold. Importantly, the decrease in memory precision did not depend on the modality of the other feature(s), or on whether the features were in the same or in separate objects. Hence, simultaneously storing one visual and one auditory feature had an effect on memory precision equal to those of simultaneously storing two visual or two auditory features. The results show that working memory is limited by the precision of the stored representations, and that working memory can be described as a resource pool that is shared across modalities.
Parallel processing on the Livermore VAX 11/780-4 parallel processor system with compatibility to Cray Research, Inc. (CRI) multitasking. Version 1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Werner, N.E.; Van Matre, S.W.

1985-05-01

This manual describes the CRI Subroutine Library and Utility Package. The CRI library provides Cray multitasking functionality on the four-processor shared memory VAX 11/780-4. Additional functionality has been added for more flexibility. A discussion of the library, utilities, error messages, and example programs is provided.
"Everybody Had a Piece ...": Collaborative Practice and Shared Decision Making at the Open Book

ERIC Educational Resources Information Center

Gordon, John; Ramdeholl, Dianne

2010-01-01

The Open Book, an adult literacy program in Brooklyn, from 1985-2002, remains, for many of the students and staff involved, a defining experience in their lives, a time that allowed them to see different possibilities, for themselves and society. In an attempt to preserve the field's collective historical memory, the authors in this chapter…
Research into display sharing techniques for distributed computing environments

NASA Technical Reports Server (NTRS)

Hugg, Steven B.; Fitzgerald, Paul F., Jr.; Rosson, Nina Y.; Johns, Stephen R.

1990-01-01

The X-based Display Sharing solution for distributed computing environments is described. The Display Sharing prototype includes the base functionality for telecast and display copy requirements. Since the prototype implementation is modular and the system design provided flexibility for the Mission Control Center Upgrade (MCCU) operational consideration, the prototype implementation can be the baseline for a production Display Sharing implementation. To facilitate the process the following discussions are presented: Theory of operation; System of architecture; Using the prototype; Software description; Research tools; Prototype evaluation; and Outstanding issues. The prototype is based on the concept of a dedicated central host performing the majority of the Display Sharing processing, allowing minimal impact on each individual workstation. Each workstation participating in Display Sharing hosts programs to facilitate the user's access to Display Sharing as host machine.
Shared Memory Parallelism for 3D Cartesian Discrete Ordinates Solver

NASA Astrophysics Data System (ADS)

Moustafa, Salli; Dutka-Malen, Ivan; Plagne, Laurent; Ponçot, Angélique; Ramet, Pierre

2014-06-01

This paper describes the design and the performance of DOMINO, a 3D Cartesian SN solver that implements two nested levels of parallelism (multicore+SIMD) on shared memory computation nodes. DOMINO is written in C++, a multi-paradigm programming language that enables the use of powerful and generic parallel programming tools such as Intel TBB and Eigen. These two libraries allow us to combine multi-thread parallelism with vector operations in an efficient and yet portable way. As a result, DOMINO can exploit the full power of modern multi-core processors and is able to tackle very large simulations, that usually require large HPC clusters, using a single computing node. For example, DOMINO solves a 3D full core PWR eigenvalue problem involving 26 energy groups, 288 angular directions (S16), 46 × 106 spatial cells and 1 × 1012 DoFs within 11 hours on a single 32-core SMP node. This represents a sustained performance of 235 GFlops and 40:74% of the SMP node peak performance for the DOMINO sweep implementation. The very high Flops/Watt ratio of DOMINO makes it a very interesting building block for a future many-nodes nuclear simulation tool.
Dynamic data distributions in Vienna Fortran

NASA Technical Reports Server (NTRS)

Chapman, Barbara; Mehrotra, Piyush; Moritsch, Hans; Zima, Hans

1993-01-01

Vienna Fortran is a machine-independent language extension of Fortran, which is based upon the Single-Program-Multiple-Data (SPMD) paradigm and allows the user to write programs for distributed-memory systems using global addresses. The language features focus mainly on the issue of distributing data across virtual processor structures. Those features of Vienna Fortran that allow the data distributions of arrays to change dynamically, depending on runtime conditions are discussed. The relevant language features are discussed, their implementation is outlined, and how they may be used in applications is described.
Performance analysis and kernel size study of the Lynx real-time operating system

NASA Technical Reports Server (NTRS)

Liu, Yuan-Kwei; Gibson, James S.; Fernquist, Alan R.

1993-01-01

This paper analyzes the Lynx real-time operating system (LynxOS), which has been selected as the operating system for the Space Station Freedom Data Management System (DMS). The features of LynxOS are compared to other Unix-based operating system (OS). The tools for measuring the performance of LynxOS, which include a high-speed digital timer/counter board, a device driver program, and an application program, are analyzed. The timings for interrupt response, process creation and deletion, threads, semaphores, shared memory, and signals are measured. The memory size of the DMS Embedded Data Processor (EDP) is limited. Besides, virtual memory is not suitable for real-time applications because page swap timing may not be deterministic. Therefore, the DMS software, including LynxOS, has to fit in the main memory of an EDP. To reduce the LynxOS kernel size, the following steps are taken: analyzing the factors that influence the kernel size; identifying the modules of LynxOS that may not be needed in an EDP; adjusting the system parameters of LynxOS; reconfiguring the device drivers used in the LynxOS; and analyzing the symbol table. The reductions in kernel disk size, kernel memory size and total kernel size reduction from each step mentioned above are listed and analyzed.
Stochastic switching of TiO2-based memristive devices with identical initial memory states

PubMed Central

2014-01-01

In this work, we show that identical TiO2-based memristive devices that possess the same initial resistive states are only phenomenologically similar as their internal structures may vary significantly, which could render quite dissimilar switching dynamics. We experimentally demonstrated that the resistive switching of practical devices with similar initial states could occur at different programming stimuli cycles. We argue that similar memory states can be transcribed via numerous distinct active core states through the dissimilar reduced TiO2-x filamentary distributions. Our hypothesis was finally verified via simulated results of the memory state evolution, by taking into account dissimilar initial filamentary distribution. PMID:24994953
Working Memory Span Development: A Time-Based Resource-Sharing Model Account

ERIC Educational Resources Information Center

Barrouillet, Pierre; Gavens, Nathalie; Vergauwe, Evie; Gaillard, Vinciane; Camos, Valerie

2009-01-01

The time-based resource-sharing model (P. Barrouillet, S. Bernardin, & V. Camos, 2004) assumes that during complex working memory span tasks, attention is frequently and surreptitiously switched from processing to reactivate decaying memory traces before their complete loss. Three experiments involving children from 5 to 14 years of age…
Direct access inter-process shared memory

DOEpatents

Brightwell, Ronald B; Pedretti, Kevin; Hudson, Trammell B

2013-10-22

A technique for directly sharing physical memory between processes executing on processor cores is described. The technique includes loading a plurality of processes into the physical memory for execution on a corresponding plurality of processor cores sharing the physical memory. An address space is mapped to each of the processes by populating a first entry in a top level virtual address table for each of the processes. The address space of each of the processes is cross-mapped into each of the processes by populating one or more subsequent entries of the top level virtual address table with the first entry in the top level virtual address table from other processes.

Grouping and binding in visual short-term memory.

PubMed

Quinlan, Philip T; Cohen, Dale J

2012-09-01

Findings of 2 experiments are reported that challenge the current understanding of visual short-term memory (VSTM). In both experiments, a single study display, containing 6 colored shapes, was presented briefly and then probed with a single colored shape. At stake is how VSTM retains a record of different objects that share common features: In the 1st experiment, 2 study items sometimes shared a common feature (either a shape or a color). The data revealed a color sharing effect, in which memory was much better for items that shared a common color than for items that did not. The 2nd experiment showed that the size of the color sharing effect depended on whether a single pair of items shared a common color or whether 2 pairs of items were so defined-memory for all items improved when 2 color groups were presented. In explaining performance, an account is advanced in which items compete for a fixed number of slots, but then memory recall for any given stored item is prone to error. A critical assumption is that items that share a common color are stored together in a slot as a chunk. The evidence provides further support for the idea that principles of perceptual organization may determine the manner in which items are stored in VSTM. PsycINFO Database Record (c) 2012 APA, all rights reserved.
Long-distance quantum key distribution with imperfect devices

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lo Piparo, Nicoló; Razavi, Mohsen

2014-12-04

Quantum key distribution over probabilistic quantum repeaters is addressed. We compare, under practical assumptions, two such schemes in terms of their secure key generation rate per memory, R{sub QKD}. The two schemes under investigation are the one proposed by Duan et al. in [Nat. 414, 413 (2001)] and that of Sangouard et al. proposed in [Phys. Rev. A 76, 050301 (2007)]. We consider various sources of imperfections in the latter protocol, such as a nonzero double-photon probability for the source, dark count per pulse, channel loss and inefficiencies in photodetectors and memories, to find the rate for different nesting levels.more » We determine the maximum value of the double-photon probability beyond which it is not possible to share a secret key anymore. We find the crossover distance for up to three nesting levels. We finally compare the two protocols.« less
Hardware/software codesign for embedded RISC core

NASA Astrophysics Data System (ADS)

Liu, Peng

2001-12-01

This paper describes hardware/software codesign method of the extendible embedded RISC core VIRGO, which based on MIPS-I instruction set architecture. VIRGO is described by Verilog hardware description language that has five-stage pipeline with shared 32-bit cache/memory interface, and it is controlled by distributed control scheme. Every pipeline stage has one small controller, which controls the pipeline stage status and cooperation among the pipeline phase. Since description use high level language and structure is distributed, VIRGO core has highly extension that can meet the requirements of application. We take look at the high-definition television MPEG2 MPHL decoder chip, constructed the hardware/software codesign virtual prototyping machine that can research on VIRGO core instruction set architecture, and system on chip memory size requirements, and system on chip software, etc. We also can evaluate the system on chip design and RISC instruction set based on the virtual prototyping machine platform.
Advanced compilation techniques in the PARADIGM compiler for distributed-memory multicomputers

NASA Technical Reports Server (NTRS)

Su, Ernesto; Lain, Antonio; Ramaswamy, Shankar; Palermo, Daniel J.; Hodges, Eugene W., IV; Banerjee, Prithviraj

1995-01-01

The PARADIGM compiler project provides an automated means to parallelize programs, written in a serial programming model, for efficient execution on distributed-memory multicomputers. .A previous implementation of the compiler based on the PTD representation allowed symbolic array sizes, affine loop bounds and array subscripts, and variable number of processors, provided that arrays were single or multi-dimensionally block distributed. The techniques presented here extend the compiler to also accept multidimensional cyclic and block-cyclic distributions within a uniform symbolic framework. These extensions demand more sophisticated symbolic manipulation capabilities. A novel aspect of our approach is to meet this demand by interfacing PARADIGM with a powerful off-the-shelf symbolic package, Mathematica. This paper describes some of the Mathematica routines that performs various transformations, shows how they are invoked and used by the compiler to overcome the new challenges, and presents experimental results for code involving cyclic and block-cyclic arrays as evidence of the feasibility of the approach.
Revision of FMM-Yukawa: An adaptive fast multipole method for screened Coulomb interactions

NASA Astrophysics Data System (ADS)

Zhang, Bo; Huang, Jingfang; Pitsianis, Nikos P.; Sun, Xiaobai

2010-12-01

FMM-YUKAWA is a mathematical software package primarily for rapid evaluation of the screened Coulomb interactions of N particles in three dimensional space. Since its release, we have revised and re-organized the data structure, software architecture, and user interface, for the purpose of enabling more flexible, broader and easier use of the package. The package and its documentation are available at http://www.fastmultipole.org/, along with a few other closely related mathematical software packages. New version program summaryProgram title: FMM-Yukawa Catalogue identifier: AEEQ_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEEQ_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU GPL 2.0 No. of lines in distributed program, including test data, etc.: 78 704 No. of bytes in distributed program, including test data, etc.: 854 265 Distribution format: tar.gz Programming language: FORTRAN 77, FORTRAN 90, and C. Requires gcc and gfortran version 4.4.3 or later Computer: All Operating system: Any Classification: 4.8, 4.12 Catalogue identifier of previous version: AEEQ_v1_0 Journal reference of previous version: Comput. Phys. Comm. 180 (2009) 2331 Does the new version supersede the previous version?: Yes Nature of problem: To evaluate the screened Coulomb potential and force field of N charged particles, and to evaluate a convolution type integral where the Green's function is the fundamental solution of the modified Helmholtz equation. Solution method: The new version of fast multipole method (FMM) that diagonalizes the multipole-to-local translation operator is applied with the tree structure adaptive to sample particle locations. Reasons for new version: To handle much larger particle ensembles, to enable the iterative use of the subroutines in a solver, and to remove potential contention in assignments for parallelization. Summary of revisions: The software package FMM-Yukawa has been revised and re-organized in data structure, software architecture, programming methods, and user interface. The revision enables more flexible use of the package and economic use of memory resources. It consists of five stages. The initial stage (stage 1) determines, based on the accuracy requirement and FMM theory, the length of multipole expansions and the number of quadrature points for diagonalization, and loads the quadrature nodes and weights that are computed off line. Stage 2 constructs the oct-tree and interaction lists, with adaptation to the sparsity or density of particles and employing a dynamic memory allocation scheme at every tree level. Stage 3 executes the core FMM subroutine for numerical calculation of the particle interactions. The subroutine can now be used iteratively as in a solver, while the particle locations remain the same. Stage 4 releases the memory allocated in Stage 2 for the adaptive tree and interaction lists. The user can modify the iterative routine easily. When the particle locations are changed such as in a molecular dynamics simulation, stage 2 to 4 can also be used together repeatedly. The final stage releases the memory space used for the quadrature and other remaining FMM parameters. Programs at the stage level and at the user interface are re-written in the C programming language, while most of the translation and interaction operations remain in FORTRAN. As a result of the change in data structures and memory allocation, the revised package can accommodate much larger particle ensembles while maintaining the same accuracy-efficiency performance. The new version is also developed as an important precursor to its parallel counterpart on multi-core or many core processors in a shared memory programming environment. Particularly, in order to ensure mutual exclusion in concurrent updates without incurring extra latency, we have replaced all the assignment statements at a source box that put its data to multiple target boxes with assignments at every target box that gather data from source boxes. This amounts to replacing the column version of matrix-vector multiplication with the row version. The matrix here, however, is in compressive representation. Sufficient care is taken in the revision not to alter the algorithmic complexity or numerical behavior, as concurrent writing potentially takes place in the upward calculation of the multipole expansion coefficients, interactions at every level of the FMM tree, and downward calculation of the local expansion coefficients. The software modules and their compositions are also organized according to the stages they are used. Demonstration files and makefiles for merging the user routines and the library routines are provided. Restrictions: Accuracy requirement is described in terms of three or six digits. Higher multiples of three digits will be allowed in a later version. Finer decimation in digits for accuracy specification may or may not be necessary. Unusual features: Ready and friendly for customized use and instrumental in expression of concurrency and dependency for efficient parallelization. Running time: The running time depends linearly on the number N of particles, and varies with the distribution characteristics of the particle distribution. It also depends on the accuracy requirement, a higher accuracy requirement takes relatively longer time. The code outperforms the direct summation method when N⩾750.
Sptrace

NASA Technical Reports Server (NTRS)

Burleigh, Scott C.

2011-01-01

Sptrace is a general-purpose space utilization tracing system that is conceptually similar to the commercial Purify product used to detect leaks and other memory usage errors. It is designed to monitor space utilization in any sort of heap, i.e., a region of data storage on some device (nominally memory; possibly shared and possibly persistent) with a flat address space. This software can trace usage of shared and/or non-volatile storage in addition to private RAM (random access memory). Sptrace is implemented as a set of C function calls that are invoked from within the software that is being examined. The function calls fall into two broad classes: (1) functions that are embedded within the heap management software [e.g., JPL's SDR (Simple Data Recorder) and PSM (Personal Space Management) systems] to enable heap usage analysis by populating a virtual time-sequenced log of usage activity, and (2) reporting functions that are embedded within the application program whose behavior is suspect. For ease of use, these functions may be wrapped privately inside public functions offered by the heap management software. Sptrace can be used for VxWorks or RTEMS realtime systems as easily as for Linux or OS/X systems.
Full Parallel Implementation of an All-Electron Four-Component Dirac-Kohn-Sham Program.

PubMed

Rampino, Sergio; Belpassi, Leonardo; Tarantelli, Francesco; Storchi, Loriano

2014-09-09

A full distributed-memory implementation of the Dirac-Kohn-Sham (DKS) module of the program BERTHA (Belpassi et al., Phys. Chem. Chem. Phys. 2011, 13, 12368-12394) is presented, where the self-consistent field (SCF) procedure is replicated on all the parallel processes, each process working on subsets of the global matrices. The key feature of the implementation is an efficient procedure for switching between two matrix distribution schemes, one (integral-driven) optimal for the parallel computation of the matrix elements and another (block-cyclic) optimal for the parallel linear algebra operations. This approach, making both CPU-time and memory scalable with the number of processors used, virtually overcomes at once both time and memory barriers associated with DKS calculations. Performance, portability, and numerical stability of the code are illustrated on the basis of test calculations on three gold clusters of increasing size, an organometallic compound, and a perovskite model. The calculations are performed on a Beowulf and a BlueGene/Q system.
Shared memories reveal shared structure in neural activity across individuals

PubMed Central

Chen, J.; Leong, Y.C.; Honey, C.J.; Yong, C.H.; Norman, K.A.; Hasson, U.

2016-01-01

Our lives revolve around sharing experiences and memories with others. When different people recount the same events, how similar are their underlying neural representations? Participants viewed a fifty-minute movie, then verbally described the events during functional MRI, producing unguided detailed descriptions lasting up to forty minutes. As each person spoke, event-specific spatial patterns were reinstated in default-network, medial-temporal, and high-level visual areas. Individual event patterns were both highly discriminable from one another and similar between people, suggesting consistent spatial organization. In many high-order areas, patterns were more similar between people recalling the same event than between recall and perception, indicating systematic reshaping of percept into memory. These results reveal the existence of a common spatial organization for memories in high-level cortical areas, where encoded information is largely abstracted beyond sensory constraints; and that neural patterns during perception are altered systematically across people into shared memory representations for real-life events. PMID:27918531
Parallel language constructs for tensor product computations on loosely coupled architectures

NASA Technical Reports Server (NTRS)

Mehrotra, Piyush; Vanrosendale, John

1989-01-01

Distributed memory architectures offer high levels of performance and flexibility, but have proven awkard to program. Current languages for nonshared memory architectures provide a relatively low level programming environment, and are poorly suited to modular programming, and to the construction of libraries. A set of language primitives designed to allow the specification of parallel numerical algorithms at a higher level is described. Tensor product array computations are focused on along with a simple but important class of numerical algorithms. The problem of programming 1-D kernal routines is focused on first, such as parallel tridiagonal solvers, and then how such parallel kernels can be combined to form parallel tensor product algorithms is examined.
An accurate, compact and computationally efficient representation of orbitals for quantum Monte Carlo calculations

NASA Astrophysics Data System (ADS)

Luo, Ye; Esler, Kenneth; Kent, Paul; Shulenburger, Luke

Quantum Monte Carlo (QMC) calculations of giant molecules, surface and defect properties of solids have been feasible recently due to drastically expanding computational resources. However, with the most computationally efficient basis set, B-splines, these calculations are severely restricted by the memory capacity of compute nodes. The B-spline coefficients are shared on a node but not distributed among nodes, to ensure fast evaluation. A hybrid representation which incorporates atomic orbitals near the ions and B-spline ones in the interstitial regions offers a more accurate and less memory demanding description of the orbitals because they are naturally more atomic like near ions and much smoother in between, thus allowing coarser B-spline grids. We will demonstrate the advantage of hybrid representation over pure B-spline and Gaussian basis sets and also show significant speed-up like computing the non-local pseudopotentials with our new scheme. Moreover, we discuss a new algorithm for atomic orbital initialization which used to require an extra workflow step taking a few days. With this work, the highly efficient hybrid representation paves the way to simulate large size even in-homogeneous systems using QMC. This work was supported by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Computational Materials Sciences Program.
Overview of the LINCS architecture

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fletcher, J.G.; Watson, R.W.

1982-01-13

Computing at the Lawrence Livermore National Laboratory (LLNL) has evolved over the past 15 years with a computer network based resource sharing environment. The increasing use of low cost and high performance micro, mini and midi computers and commercially available local networking systems will accelerate this trend. Further, even the large scale computer systems, on which much of the LLNL scientific computing depends, are evolving into multiprocessor systems. It is our belief that the most cost effective use of this environment will depend on the development of application systems structured into cooperating concurrent program modules (processes) distributed appropriately over differentmore » nodes of the environment. A node is defined as one or more processors with a local (shared) high speed memory. Given the latter view, the environment can be characterized as consisting of: multiple nodes communicating over noisy channels with arbitrary delays and throughput, heterogenous base resources and information encodings, no single administration controlling all resources, distributed system state, and no uniform time base. The system design problem is - how to turn the heterogeneous base hardware/firmware/software resources of this environment into a coherent set of resources that facilitate development of cost effective, reliable, and human engineered applications. We believe the answer lies in developing a layered, communication oriented distributed system architecture; layered and modular to support ease of understanding, reconfiguration, extensibility, and hiding of implementation or nonessential local details; communication oriented because that is a central feature of the environment. The Livermore Interactive Network Communication System (LINCS) is a hierarchical architecture designed to meet the above needs. While having characteristics in common with other architectures, it differs in several respects.« less
I remember you: a role for memory in social cognition and the functional neuroanatomy of their interaction.

PubMed

Spreng, R Nathan; Mar, Raymond A

2012-01-05

Remembering events from the personal past (autobiographical memory) and inferring the thoughts and feelings of other people (mentalizing) share a neural substrate. The shared functional neuroanatomy of these processes has been demonstrated in a meta-analysis of independent task domains (Spreng, Mar & Kim, 2009) and within subjects performing both tasks (Rabin, Gilboa, Stuss, Mar, & Rosenbaum, 2010; Spreng & Grady, 2010). Here, we examine spontaneous low-frequency fluctuations in fMRI BOLD signal during rest from two separate regions key to memory and mentalizing, the left hippocampus and right temporal parietal junction, respectively. Activity in these two regions was then correlated with the entire brain in a resting-state functional connectivity analysis. Although the left hippocampus and right temporal parietal junction were not correlated with each other, both were correlated with a distributed network of brain regions. These regions were consistent with the previously observed overlap between autobiographical memory and mentalizing evoked brain activity found in past studies. Reliable patterns of overlap included the superior temporal sulcus, anterior temporal lobe, lateral inferior parietal cortex (angular gyrus), posterior cingulate cortex, dorsomedial and ventral prefrontal cortex, inferior frontal gyrus, and the amygdala. We propose that the functional overlap facilitates the integration of personal and interpersonal information and provides a means for personal experiences to become social conceptual knowledge. This knowledge, in turn, informs strategic social behavior in support of personal goals. In closing, we argue for a new perspective within social cognitive neuroscience, emphasizing the importance of memory in social cognition. Copyright © 2010 Elsevier B.V. All rights reserved.
I remember you: A role for memory in social cognition and the functional neuroanatomy of their interaction

PubMed Central

Spreng, R. Nathan; Mar, Raymond A.

2011-01-01

Remembering events from the personal past (autobiographical memory) and inferring the thoughts and feelings of other people (mentalizing) share a neural substrate. The shared functional neuroanatomy of these processes has been demonstrated in a meta-analysis of independent task domains (Spreng, Mar & Kim, 2009) and within subjects performing both tasks (Rabin, Gilboa, Stuss, Mar, & Rosenbaum, 2010; Spreng & Grady, 2010). Here, we examine spontaneous low-frequency fluctuations in fMRI BOLD signal during rest from two separate regions key to memory and mentalizing, the left hippocampus and right temporal parietal junction, respectively. Activity in these two regions was then correlated with the entire brain in a resting-state functional connectivity analysis. Although the left hippocampus and right temporal parietal junction were not correlated with each other, both were correlated with a distributed network of brain regions. These regions were consistent with the previously observed overlap between autobiographical memory and mentalizing evoked brain activity found in past studies. Reliable patterns of overlap included the superior temporal sulcus, anterior temporal lobe, lateral inferior parietal cortex (angular gyrus), posterior cingulate cortex, dorsomedial and ventral prefrontal cortex, inferior frontal gyrus, and the amygdala. We propose that the functional overlap facilitates the integration of personal and interpersonal information and provides a means for personal experiences to become social conceptual knowledge. This knowledge, in turn, informs strategic social behavior in support of personal goals. In closing, we argue for a new perspective within social cognitive neuroscience, emphasizing the importance of memory in social cognition. PMID:21172325
Effects of Network Structure, Competition and Memory Time on Social Spreading Phenomena

NASA Astrophysics Data System (ADS)

Gleeson, James P.; O'Sullivan, Kevin P.; Baños, Raquel A.; Moreno, Yamir

2016-04-01

Online social media has greatly affected the way in which we communicate with each other. However, little is known about what fundamental mechanisms drive dynamical information flow in online social systems. Here, we introduce a generative model for online sharing behavior that is analytically tractable and that can reproduce several characteristics of empirical micro-blogging data on hashtag usage, such as (time-dependent) heavy-tailed distributions of meme popularity. The presented framework constitutes a null model for social spreading phenomena that, in contrast to purely empirical studies or simulation-based models, clearly distinguishes the roles of two distinct factors affecting meme popularity: the memory time of users and the connectivity structure of the social network.
GoFFish: A Sub-Graph Centric Framework for Large-Scale Graph Analytics1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Simmhan, Yogesh; Kumbhare, Alok; Wickramaarachchi, Charith

2014-08-25

Large scale graph processing is a major research area for Big Data exploration. Vertex centric programming models like Pregel are gaining traction due to their simple abstraction that allows for scalable execution on distributed systems naturally. However, there are limitations to this approach which cause vertex centric algorithms to under-perform due to poor compute to communication overhead ratio and slow convergence of iterative superstep. In this paper we introduce GoFFish a scalable sub-graph centric framework co-designed with a distributed persistent graph storage for large scale graph analytics on commodity clusters. We introduce a sub-graph centric programming abstraction that combines themore » scalability of a vertex centric approach with the flexibility of shared memory sub-graph computation. We map Connected Components, SSSP and PageRank algorithms to this model to illustrate its flexibility. Further, we empirically analyze GoFFish using several real world graphs and demonstrate its significant performance improvement, orders of magnitude in some cases, compared to Apache Giraph, the leading open source vertex centric implementation. We map Connected Components, SSSP and PageRank algorithms to this model to illustrate its flexibility. Further, we empirically analyze GoFFish using several real world graphs and demonstrate its significant performance improvement, orders of magnitude in some cases, compared to Apache Giraph, the leading open source vertex centric implementation.« less
Development of a Dynamic Time Sharing Scheduled Environment Final Report CRADA No. TC-824-94E

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jette, M.; Caliga, D.

Massively parallel computers, such as the Cray T3D, have historically supported resource sharing solely with space sharing. In that method, multiple problems are solved by executing them on distinct processors. This project developed a dynamic time- and space-sharing scheduler to achieve greater interactivity and throughput than could be achieved with space-sharing alone. CRI and LLNL worked together on the design, testing, and review aspects of this project. There were separate software deliverables. CFU implemented a general purpose scheduling system as per the design specifications. LLNL ported the local gang scheduler software to the LLNL Cray T3D. In this approach, processorsmore » are allocated simultaneously to aU components of a parallel program (in a “gang”). Program execution is preempted as needed to provide for interactivity. Programs are also reIocated to different processors as needed to efficiently pack the computer’s torus of processors. In phase one, CRI developed an interface specification after discussions with LLNL for systemlevel software supporting a time- and space-sharing environment on the LLNL T3D. The two parties also discussed interface specifications for external control tools (such as scheduling policy tools, system administration tools) and applications programs. CRI assumed responsibility for the writing and implementation of all the necessary system software in this phase. In phase two, CRI implemented job-rolling on the Cray T3D, a mechanism for preempting a program, saving its state to disk, and later restoring its state to memory for continued execution. LLNL ported its gang scheduler to the LLNL T3D utilizing the CRI interface implemented in phases one and two. During phase three, the functionality and effectiveness of the LLNL gang scheduler was assessed to provide input to CRI time- and space-sharing, efforts. CRI will utilize this information in the development of general schedulers suitable for other sites and future architectures.« less
Representation-Independent Iteration of Sparse Data Arrays

NASA Technical Reports Server (NTRS)

James, Mark

2007-01-01

An approach is defined that describes a method of iterating over massively large arrays containing sparse data using an approach that is implementation independent of how the contents of the sparse arrays are laid out in memory. What is unique and important here is the decoupling of the iteration over the sparse set of array elements from how they are internally represented in memory. This enables this approach to be backward compatible with existing schemes for representing sparse arrays as well as new approaches. What is novel here is a new approach for efficiently iterating over sparse arrays that is independent of the underlying memory layout representation of the array. A functional interface is defined for implementing sparse arrays in any modern programming language with a particular focus for the Chapel programming language. Examples are provided that show the translation of a loop that computes a matrix vector product into this representation for both the distributed and not-distributed cases. This work is directly applicable to NASA and its High Productivity Computing Systems (HPCS) program that JPL and our current program are engaged in. The goal of this program is to create powerful, scalable, and economically viable high-powered computer systems suitable for use in national security and industry by 2010. This is important to NASA for its computationally intensive requirements for analyzing and understanding the volumes of science data from our returned missions.
Sleep Benefits Memory for Semantic Category Structure While Preserving Exemplar-Specific Information.

PubMed

Schapiro, Anna C; McDevitt, Elizabeth A; Chen, Lang; Norman, Kenneth A; Mednick, Sara C; Rogers, Timothy T

2017-11-01

Semantic memory encompasses knowledge about both the properties that typify concepts (e.g. robins, like all birds, have wings) as well as the properties that individuate conceptually related items (e.g. robins, in particular, have red breasts). We investigate the impact of sleep on new semantic learning using a property inference task in which both kinds of information are initially acquired equally well. Participants learned about three categories of novel objects possessing some properties that were shared among category exemplars and others that were unique to an exemplar, with exposure frequency varying across categories. In Experiment 1, memory for shared properties improved and memory for unique properties was preserved across a night of sleep, while memory for both feature types declined over a day awake. In Experiment 2, memory for shared properties improved across a nap, but only for the lower-frequency category, suggesting a prioritization of weakly learned information early in a sleep period. The increase was significantly correlated with amount of REM, but was also observed in participants who did not enter REM, suggesting involvement of both REM and NREM sleep. The results provide the first evidence that sleep improves memory for the shared structure of object categories, while simultaneously preserving object-unique information.
The Raid distributed database system

NASA Technical Reports Server (NTRS)

Bhargava, Bharat; Riedl, John

1989-01-01

Raid, a robust and adaptable distributed database system for transaction processing (TP), is described. Raid is a message-passing system, with server processes on each site to manage concurrent processing, consistent replicated copies during site failures, and atomic distributed commitment. A high-level layered communications package provides a clean location-independent interface between servers. The latest design of the package delivers messages via shared memory in a configuration with several servers linked into a single process. Raid provides the infrastructure to investigate various methods for supporting reliable distributed TP. Measurements on TP and server CPU time are presented, along with data from experiments on communications software, consistent replicated copy control during site failures, and concurrent distributed checkpointing. A software tool for evaluating the implementation of TP algorithms in an operating-system kernel is proposed.
Automatic data partitioning on distributed memory multicomputers. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Gupta, Manish

1992-01-01

Distributed-memory parallel computers are increasingly being used to provide high levels of performance for scientific applications. Unfortunately, such machines are not very easy to program. A number of research efforts seek to alleviate this problem by developing compilers that take over the task of generating communication. The communication overheads and the extent of parallelism exploited in the resulting target program are determined largely by the manner in which data is partitioned across different processors of the machine. Most of the compilers provide no assistance to the programmer in the crucial task of determining a good data partitioning scheme. A novel approach is presented, the constraints-based approach, to the problem of automatic data partitioning for numeric programs. In this approach, the compiler identifies some desirable requirements on the distribution of various arrays being referenced in each statement, based on performance considerations. These desirable requirements are referred to as constraints. For each constraint, the compiler determines a quality measure that captures its importance with respect to the performance of the program. The quality measure is obtained through static performance estimation, without actually generating the target data-parallel program with explicit communication. Each data distribution decision is taken by combining all the relevant constraints. The compiler attempts to resolve any conflicts between constraints such that the overall execution time of the parallel program is minimized. This approach has been implemented as part of a compiler called Paradigm, that accepts Fortran 77 programs, and specifies the partitioning scheme to be used for each array in the program. We have obtained results on some programs taken from the Linpack and Eispack libraries, and the Perfect Benchmarks. These results are quite promising, and demonstrate the feasibility of automatic data partitioning for a significant class of scientific application programs with regular computations.

A Tutorial on Parallel and Concurrent Programming in Haskell

NASA Astrophysics Data System (ADS)

Peyton Jones, Simon; Singh, Satnam

This practical tutorial introduces the features available in Haskell for writing parallel and concurrent programs. We first describe how to write semi-explicit parallel programs by using annotations to express opportunities for parallelism and to help control the granularity of parallelism for effective execution on modern operating systems and processors. We then describe the mechanisms provided by Haskell for writing explicitly parallel programs with a focus on the use of software transactional memory to help share information between threads. Finally, we show how nested data parallelism can be used to write deterministically parallel programs which allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors.
Compiling global name-space programs for distributed execution

NASA Technical Reports Server (NTRS)

Koelbel, Charles; Mehrotra, Piyush

1990-01-01

Distributed memory machines do not provide hardware support for a global address space. Thus programmers are forced to partition the data across the memories of the architecture and use explicit message passing to communicate data between processors. The compiler support required to allow programmers to express their algorithms using a global name-space is examined. A general method is presented for analysis of a high level source program and along with its translation to a set of independently executing tasks communicating via messages. If the compiler has enough information, this translation can be carried out at compile-time. Otherwise run-time code is generated to implement the required data movement. The analysis required in both situations is described and the performance of the generated code on the Intel iPSC/2 is presented.
Sharing out NASA's spoils. [economic benefits of U.S. space program

NASA Technical Reports Server (NTRS)

Bezdek, Roger H.; Wendling, Robert M.

1992-01-01

The economic benefits of NASA programs are discussed. Emphasis is given to an analysis of indirect economic benefits which estimates the effect of NASA programs on employment, personal income, corporate sales and profits, and government tax revenues in the U.S. and in each state. Data are presented that show that NASA programs have widely varying multipliers by industry and that illustrate the distribution of jobs by industry as well as the distribution of sales.
25 CFR 1000.98 - How does BIA determine a Tribe's/Consortium's share of funds to be included in an AFA?

Code of Federal Regulations, 2013 CFR

2013-04-01

... be included in the AFA: (a) Formula-driven. For formula-driven programs, a Tribe's/Consortium's... applying the distribution formula to the remaining eligible funding for each program involved. (1) Distribution formulas must be reasonably related to the function or service performed by an office, and must be...
25 CFR 1000.98 - How does BIA determine a Tribe's/Consortium's share of funds to be included in an AFA?

Code of Federal Regulations, 2010 CFR

2010-04-01

... be included in the AFA: (a) Formula-driven. For formula-driven programs, a Tribe's/Consortium's... applying the distribution formula to the remaining eligible funding for each program involved. (1) Distribution formulas must be reasonably related to the function or service performed by an office, and must be...
25 CFR 1000.98 - How does BIA determine a Tribe's/Consortium's share of funds to be included in an AFA?

Code of Federal Regulations, 2012 CFR

2012-04-01

... be included in the AFA: (a) Formula-driven. For formula-driven programs, a Tribe's/Consortium's... applying the distribution formula to the remaining eligible funding for each program involved. (1) Distribution formulas must be reasonably related to the function or service performed by an office, and must be...
25 CFR 1000.98 - How does BIA determine a Tribe's/Consortium's share of funds to be included in an AFA?

Code of Federal Regulations, 2011 CFR

2011-04-01

... be included in the AFA: (a) Formula-driven. For formula-driven programs, a Tribe's/Consortium's... applying the distribution formula to the remaining eligible funding for each program involved. (1) Distribution formulas must be reasonably related to the function or service performed by an office, and must be...
25 CFR 1000.98 - How does BIA determine a Tribe's/Consortium's share of funds to be included in an AFA?

Code of Federal Regulations, 2014 CFR

2014-04-01

... be included in the AFA: (a) Formula-driven. For formula-driven programs, a Tribe's/Consortium's... applying the distribution formula to the remaining eligible funding for each program involved. (1) Distribution formulas must be reasonably related to the function or service performed by an office, and must be...
Structurally Integrated Versus Structurally Segregated Memory Representations: Implications for the Design of Instructional Materials.

ERIC Educational Resources Information Center

Hayes-Roth, Barbara

Two kinds of memory organization are distinguished: segregrated versus integrated. In segregated memory organizations, related learned propositions have separate memory representations. In integrated memory organizations, memory representations of related propositions share common subrepresentations. Segregated memory organizations facilitate…
Getting connected: Both associative and semantic links structure semantic memory for newly learned persons.

PubMed

Wiese, Holger; Schweinberger, Stefan R

2015-01-01

The present study examined whether semantic memory for newly learned people is structured by visual co-occurrence, shared semantics, or both. Participants were trained with pairs of simultaneously presented (i.e., co-occurring) preexperimentally unfamiliar faces, which either did or did not share additionally provided semantic information (occupation, place of living, etc.). Semantic information could also be shared between faces that did not co-occur. A subsequent priming experiment revealed faster responses for both co-occurrence/no shared semantics and no co-occurrence/shared semantics conditions, than for an unrelated condition. Strikingly, priming was strongest in the co-occurrence/shared semantics condition, suggesting additive effects of these factors. Additional analysis of event-related brain potentials yielded priming in the N400 component only for combined effects of visual co-occurrence and shared semantics, with more positive amplitudes in this than in the unrelated condition. Overall, these findings suggest that both semantic relatedness and visual co-occurrence are important when novel information is integrated into person-related semantic memory.
Categorical and associative relations increase false memory relative to purely associative relations.

PubMed

Coane, Jennifer H; McBride, Dawn M; Termonen, Miia-Liisa; Cutting, J Cooper

2016-01-01

The goal of the present study was to examine the contributions of associative strength and similarity in terms of shared features to the production of false memories in the Deese/Roediger-McDermott list-learning paradigm. Whereas the activation/monitoring account suggests that false memories are driven by automatic associative activation from list items to nonpresented lures, combined with errors in source monitoring, other accounts (e.g., fuzzy trace theory, global-matching models) emphasize the importance of semantic-level similarity, and thus predict that shared features between list and lure items will increase false memory. Participants studied lists of nine items related to a nonpresented lure. Half of the lists consisted of items that were associated but did not share features with the lure, and the other half included items that were equally associated but also shared features with the lure (in many cases, these were taxonomically related items). The two types of lists were carefully matched in terms of a variety of lexical and semantic factors, and the same lures were used across list types. In two experiments, false recognition of the critical lures was greater following the study of lists that shared features with the critical lure, suggesting that similarity at a categorical or taxonomic level contributes to false memory above and beyond associative strength. We refer to this phenomenon as a "feature boost" that reflects additive effects of shared meaning and association strength and is generally consistent with accounts of false memory that have emphasized thematic or feature-level similarity among studied and nonstudied representations.
Criminal Regulation of Anti-Forensic Tools in Japan

NASA Astrophysics Data System (ADS)

Ishii, Tetsuya

This paper discusses the continuing landmark debate in a Japanese Court concerning the development and distribution of a peer-to-peer (P2P) file sharing program. The program, known as Winny, facilitates illegal activities such as piracy and the distribution of child pornography because of the encryption and anonymity afforded to users. The court has to determine whether Isamu Kaneko, the designer of Winny, is criminally liable for developing and distributing the program. This paper also assesses whether the judgment in the Winny case might set a precedent for regulating the creation and distribution of anti-forensic tools.
Hybrid Parallelism for Volume Rendering on Large-, Multi-, and Many-Core Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Howison, Mark; Bethel, E. Wes; Childs, Hank

2012-01-01

With the computing industry trending towards multi- and many-core processors, we study how a standard visualization algorithm, ray-casting volume rendering, can benefit from a hybrid parallelism approach. Hybrid parallelism provides the best of both worlds: using distributed-memory parallelism across a large numbers of nodes increases available FLOPs and memory, while exploiting shared-memory parallelism among the cores within each node ensures that each node performs its portion of the larger calculation as efficiently as possible. We demonstrate results from weak and strong scaling studies, at levels of concurrency ranging up to 216,000, and with datasets as large as 12.2 trillion cells.more » The greatest benefit from hybrid parallelism lies in the communication portion of the algorithm, the dominant cost at higher levels of concurrency. We show that reducing the number of participants with a hybrid approach significantly improves performance.« less
An Evaluation of Architectural Platforms for Parallel Navier-Stokes Computations

NASA Technical Reports Server (NTRS)

Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.

1996-01-01

We study the computational, communication, and scalability characteristics of a computational fluid dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architecture platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and distributed memory multiprocessors with different topologies - the IBM SP and the Cray T3D. We investigate the impact of various networks connecting the cluster of workstations on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms

NASA Technical Reports Server (NTRS)

Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.

1997-01-01

We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies-the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Early programming and late-acting checkpoints governing the development of CD4 T cell memory.

PubMed

Dhume, Kunal; McKinstry, K Kai

2018-04-27

CD4 T cells contribute to protection against pathogens through numerous mechanisms. Incorporating the goal of memory CD4 T cell generation into vaccine strategies thus offers a powerful approach to improve their efficacy, especially in situations where humoral responses alone cannot confer long-term immunity. These threats include viruses such as influenza that mutate coat proteins to avoid neutralizing antibodies, but that are targeted by T cells that recognize more conserved protein epitopes shared by different strains. A major barrier in the design of such vaccines is that the mechanisms controlling the efficiency with which memory cells form remain incompletely understood. Here, we discuss recent insights into fate decisions controlling memory generation. We focus on the importance of three general cues: interleukin-2, antigen, and costimulatory interactions. It is increasingly clear that these signals have a powerful influence on the capacity of CD4 T cells to form memory during two distinct phases of the immune response. First, through 'programming' that occurs during initial priming, and second, through 'checkpoints' that operate later during the effector stage. These findings indicate that novel vaccine strategies must seek to optimize cognate interactions, during which interleukin-2-, antigen, and costimulation-dependent signals are tightly linked, well beyond initial antigen encounter to induce robust memory CD4 T cells. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
The attentional blink reveals serial working memory encoding: evidence from virtual and human event-related potentials.

PubMed

Craston, Patrick; Wyble, Brad; Chennu, Srivas; Bowman, Howard

2009-03-01

Observers often miss a second target (T2) if it follows an identified first target item (T1) within half a second in rapid serial visual presentation (RSVP), a finding termed the attentional blink. If two targets are presented in immediate succession, however, accuracy is excellent (Lag 1 sparing). The resource sharing hypothesis proposes a dynamic distribution of resources over a time span of up to 600 msec during the attentional blink. In contrast, the ST(2) model argues that working memory encoding is serial during the attentional blink and that, due to joint consolidation, Lag 1 is the only case where resources are shared. Experiment 1 investigates the P3 ERP component evoked by targets in RSVP. The results suggest that, in this context, P3 amplitude is an indication of bottom-up strength rather than a measure of cognitive resource allocation. Experiment 2, employing a two-target paradigm, suggests that T1 consolidation is not affected by the presentation of T2 during the attentional blink. However, if targets are presented in immediate succession (Lag 1 sparing), they are jointly encoded into working memory. We use the ST(2) model's neural network implementation, which replicates a range of behavioral results related to the attentional blink, to generate "virtual ERPs" by summing across activation traces. We compare virtual to human ERPs and show how the results suggest a serial nature of working memory encoding as implied by the ST(2) model.
The MOLDY short-range molecular dynamics package

NASA Astrophysics Data System (ADS)

Ackland, G. J.; D'Mellow, K.; Daraszewicz, S. L.; Hepburn, D. J.; Uhrin, M.; Stratford, K.

2011-12-01

We describe a parallelised version of the MOLDY molecular dynamics program. This Fortran code is aimed at systems which may be described by short-range potentials and specifically those which may be addressed with the embedded atom method. This includes a wide range of transition metals and alloys. MOLDY provides a range of options in terms of the molecular dynamics ensemble used and the boundary conditions which may be applied. A number of standard potentials are provided, and the modular structure of the code allows new potentials to be added easily. The code is parallelised using OpenMP and can therefore be run on shared memory systems, including modern multicore processors. Particular attention is paid to the updates required in the main force loop, where synchronisation is often required in OpenMP implementations of molecular dynamics. We examine the performance of the parallel code in detail and give some examples of applications to realistic problems, including the dynamic compression of copper and carbon migration in an iron-carbon alloy. Program summaryProgram title: MOLDY Catalogue identifier: AEJU_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEJU_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU General Public License version 2 No. of lines in distributed program, including test data, etc.: 382 881 No. of bytes in distributed program, including test data, etc.: 6 705 242 Distribution format: tar.gz Programming language: Fortran 95/OpenMP Computer: Any Operating system: Any Has the code been vectorised or parallelized?: Yes. OpenMP is required for parallel execution RAM: 100 MB or more Classification: 7.7 Nature of problem: Moldy addresses the problem of many atoms (of order 10 6) interacting via a classical interatomic potential on a timescale of microseconds. It is designed for problems where statistics must be gathered over a number of equivalent runs, such as measuring thermodynamic properities, diffusion, radiation damage, fracture, twinning deformation, nucleation and growth of phase transitions, sputtering etc. In the vast majority of materials, the interactions are non-pairwise, and the code must be able to deal with many-body forces. Solution method: Molecular dynamics involves integrating Newton's equations of motion. MOLDY uses verlet (for good energy conservation) or predictor-corrector (for accurate trajectories) algorithms. It is parallelised using open MP. It also includes a static minimisation routine to find the lowest energy structure. Boundary conditions for surfaces, clusters, grain boundaries, thermostat (Nose), barostat (Parrinello-Rahman), and externally applied strain are provided. The initial configuration can be either a repeated unit cell or have all atoms given explictly. Initial velocities are generated internally, but it is also possible to specify the velocity of a particular atom. A wide range of interatomic force models are implemented, including embedded atom, Morse or Lennard-Jones. Thus the program is especially well suited to calculations of metals. Restrictions: The code is designed for short-ranged potentials, and there is no Ewald sum. Thus for long range interactions where all particles interact with all others, the order- N scaling will fail. Different interatomic potential forms require recompilation of the code. Additional comments: There is a set of associated open-source analysis software for postprocessing and visualisation. This includes local crystal structure recognition and identification of topological defects. Running time: A set of test modules for running time are provided. The code scales as order N. The parallelisation shows near-linear scaling with number of processors in a shared memory environment. A typical run of a few tens of nanometers for a few nanoseconds will run on a timescale of days on a multiprocessor desktop.
Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications

NASA Astrophysics Data System (ADS)

Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.

2015-06-01

The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version with auto-vectorisation and also shared memory approach. In this scenario GPU computing is the best option since it provides a homogeneous behaviour. More specifically, the speedup of GPU computing achieves an upper limit of 12 for both one and two GPUs, whereas the performance reaches peak values of 80 GFlops and 146 GFlops for the performance for one GPU and two GPUs respectively. Finally, the method is applied to an earth crust profile in order to demonstrate the potential of our approach and the necessity of applying acceleration strategies in these type of applications.
The Locus Preservation Hypothesis: Shared Linguistic Profiles across Developmental Disorders and the Resilient Part of the Human Language Faculty

PubMed Central

Leivada, Evelina; Kambanaros, Maria; Grohmann, Kleanthes K.

2017-01-01

Grammatical markers are not uniformly impaired across speakers of different languages, even when speakers share a diagnosis and the marker in question is grammaticalized in a similar way in these languages. The aim of this work is to demarcate, from a cross-linguistic perspective, the linguistic phenotype of three genetically heterogeneous developmental disorders: specific language impairment, Down syndrome, and autism spectrum disorder. After a systematic review of linguistic profiles targeting mainly English-, Greek-, Catalan-, and Spanish-speaking populations with developmental disorders (n = 880), shared loci of impairment are identified and certain domains of grammar are shown to be more vulnerable than others. The distribution of impaired loci is captured by the Locus Preservation Hypothesis which suggests that specific parts of the language faculty are immune to impairment across developmental disorders. Through the Locus Preservation Hypothesis, a classical chicken and egg question can be addressed: Do poor conceptual resources and memory limitations result in an atypical grammar or does a grammatical breakdown lead to conceptual and memory limitations? Overall, certain morphological markers reveal themselves as highly susceptible to impairment, while syntactic operations are preserved, granting support to the first scenario. The origin of resilient syntax is explained from a phylogenetic perspective in connection to the “syntax-before-phonology” hypothesis. PMID:29081756

Shared processing in multiple object tracking and visual working memory in the absence of response order and task order confounds

PubMed Central

Howe, Piers D. L.

2017-01-01

To understand how the visual system represents multiple moving objects and how those representations contribute to tracking, it is essential that we understand how the processes of attention and working memory interact. In the work described here we present an investigation of that interaction via a series of tracking and working memory dual-task experiments. Previously, it has been argued that tracking is resistant to disruption by a concurrent working memory task and that any apparent disruption is in fact due to observers making a response to the working memory task, rather than due to competition for shared resources. Contrary to this, in our experiments we find that when task order and response order confounds are avoided, all participants show a similar decrease in both tracking and working memory performance. However, if task and response order confounds are not adequately controlled for we find substantial individual differences, which could explain the previous conflicting reports on this topic. Our results provide clear evidence that tracking and working memory tasks share processing resources. PMID:28410383
Shared processing in multiple object tracking and visual working memory in the absence of response order and task order confounds.

PubMed

Lapierre, Mark D; Cropper, Simon J; Howe, Piers D L

2017-01-01

To understand how the visual system represents multiple moving objects and how those representations contribute to tracking, it is essential that we understand how the processes of attention and working memory interact. In the work described here we present an investigation of that interaction via a series of tracking and working memory dual-task experiments. Previously, it has been argued that tracking is resistant to disruption by a concurrent working memory task and that any apparent disruption is in fact due to observers making a response to the working memory task, rather than due to competition for shared resources. Contrary to this, in our experiments we find that when task order and response order confounds are avoided, all participants show a similar decrease in both tracking and working memory performance. However, if task and response order confounds are not adequately controlled for we find substantial individual differences, which could explain the previous conflicting reports on this topic. Our results provide clear evidence that tracking and working memory tasks share processing resources.
Visual and spatial working memory are not that dissociated after all: a time-based resource-sharing account.

PubMed

Vergauwe, Evie; Barrouillet, Pierre; Camos, Valérie

2009-07-01

Examinations of interference between visual and spatial materials in working memory have suggested domain- and process-based fractionations of visuo-spatial working memory. The present study examined the role of central time-based resource sharing in visuo-spatial working memory and assessed its role in obtained interference patterns. Visual and spatial storage were combined with both visual and spatial on-line processing components in computer-paced working memory span tasks (Experiment 1) and in a selective interference paradigm (Experiment 2). The cognitive load of the processing components was manipulated to investigate its impact on concurrent maintenance for both within-domain and between-domain combinations of processing and storage components. In contrast to both domain- and process-based fractionations of visuo-spatial working memory, the results revealed that recall performance was determined by the cognitive load induced by the processing of items, rather than by the domain to which those items pertained. These findings are interpreted as evidence for a time-based resource-sharing mechanism in visuo-spatial working memory.
Discrim: a computer program using an interactive approach to dissect a mixture of normal or lognormal distributions

USGS Publications Warehouse

Bridges, N.J.; McCammon, R.B.

1980-01-01

DISCRIM is an interactive computer graphics program that dissects mixtures of normal or lognormal distributions. The program was written in an effort to obtain a more satisfactory solution to the dissection problem than that offered by a graphical or numerical approach alone. It combines graphic and analytic techniques using a Tektronix1 terminal in a time-share computing environment. The main program and subroutines were written in the FORTRAN language. ?? 1980.
CRITIC2: A program for real-space analysis of quantum chemical interactions in solids

NASA Astrophysics Data System (ADS)

Otero-de-la-Roza, A.; Johnson, Erin R.; Luaña, Víctor

2014-03-01

We present CRITIC2, a program for the analysis of quantum-mechanical atomic and molecular interactions in periodic solids. This code, a greatly improved version of the previous CRITIC program (Otero-de-la Roza et al., 2009), can: (i) find critical points of the electron density and related scalar fields such as the electron localization function (ELF), Laplacian, … (ii) integrate atomic properties in the framework of Bader’s Atoms-in-Molecules theory (QTAIM), (iii) visualize non-covalent interactions in crystals using the non-covalent interactions (NCI) index, (iv) generate relevant graphical representations including lines, planes, gradient paths, contour plots, atomic basins, … and (v) perform transformations between file formats describing scalar fields and crystal structures. CRITIC2 can interface with the output produced by a variety of electronic structure programs including WIEN2k, elk, PI, abinit, Quantum ESPRESSO, VASP, Gaussian, and, in general, any other code capable of writing the scalar field under study to a three-dimensional grid. CRITIC2 is parallelized, completely documented (including illustrative test cases) and publicly available under the GNU General Public License. Catalogue identifier: AECB_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AECB_v2_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: yes No. of lines in distributed program, including test data, etc.: 11686949 No. of bytes in distributed program, including test data, etc.: 337020731 Distribution format: tar.gz Programming language: Fortran 77 and 90. Computer: Workstations. Operating system: Unix, GNU/Linux. Has the code been vectorized or parallelized?: Shared-memory parallelization can be used for most tasks. Classification: 7.3. Catalogue identifier of previous version: AECB_v1_0 Journal reference of previous version: Comput. Phys. Comm. 180 (2009) 157 Nature of problem: Analysis of quantum-chemical interactions in periodic solids by means of atoms-in-molecules and related formalisms. Solution method: Critical point search using Newton’s algorithm, atomic basin integration using bisection, qtree and grid-based algorithms, diverse graphical representations and computation of the non-covalent interactions index on a three-dimensional grid. Additional comments: !!!!! The distribution file for this program is over 330 Mbytes and therefore is not delivered directly when download or Email is requested. Instead a html file giving details of how the program can be obtained is sent. !!!!! Running time: Variable, depending on the crystal and the source of the underlying scalar field.
[Neuroscience and collective memory: memory schemas linking brain, societies and cultures].

PubMed

Legrand, Nicolas; Gagnepain, Pierre; Peschanski, Denis; Eustache, Francis

2015-01-01

During the last two decades, the effect of intersubjective relationships on cognition has been an emerging topic in cognitive neurosciences leading through a so-called "social turn" to the formation of new domains integrating society and cultures to this research area. Such inquiry has been recently extended to collective memory studies. Collective memory refers to shared representations that are constitutive of the identity of a group and distributed among all its members connected by a common history. After briefly describing those evolutions in the study of human brain and behaviors, we review recent researches that have brought together cognitive psychology, neuroscience and social sciences into collective memory studies. Using the reemerging concept of memory schema, we propose a theoretical framework allowing to account for collective memories formation with a specific focus on the encoding process of historical events. We suggest that (1) if the concept of schema has been mainly used to describe rather passive framework of knowledge, such structure may also be implied in more active fashions in the understanding of significant collective events. And, (2) if some schema researches have restricted themselves to the individual level of inquiry, we describe a strong coherence between memory and cultural frameworks. Integrating the neural basis and properties of memory schema to collective memory studies may pave the way toward a better understanding of the reciprocal interaction between individual memories and cultural resources such as media or education. © Société de Biologie, 2016.
Automatic Data Partitioning on Distributed Memory Multiprocessors

DTIC Science & Technology

1990-10-01

DISTRIBUTED MEMORY MULTIPROCESSORS D 1"’ 1 C . Manish Gupta NOV,1 41990.NOV 1 41990m Prithviraj Banerjee D Coordinated Science Laboratory College of...developed on the partitioning of arrays can as well be applied to other programming languages, such as C . 3 The rest of this paper is organized as follows...value 1, as in Fortran. a) N= 4, N 2 = 1: f(i) = J,(j) = 03 b) =Ni 1, N 2 =4: fA(i) =, f() - c ) NI 2, X) 2: f()=[., f2(j) = [L-.j d) N 1 , N 2 =4: fA(i
Why are you telling me that? A conceptual model of the social function of autobiographical memory.

PubMed

Alea, Nicole; Bluck, Susan

2003-03-01

In an effort to stimulate and guide empirical work within a functional framework, this paper provides a conceptual model of the social functions of autobiographical memory (AM) across the lifespan. The model delineates the processes and variables involved when AMs are shared to serve social functions. Components of the model include: lifespan contextual influences, the qualitative characteristics of memory (emotionality and level of detail recalled), the speaker's characteristics (age, gender, and personality), the familiarity and similarity of the listener to the speaker, the level of responsiveness during the memory-sharing process, and the nature of the social relationship in which the memory sharing occurs (valence and length of the relationship). These components are shown to influence the type of social function served and/or, the extent to which social functions are served. Directions for future empirical work to substantiate the model and hypotheses derived from the model are provided.
Implementation and performance of FDPS: a framework for developing parallel particle simulation codes

NASA Astrophysics Data System (ADS)

Iwasawa, Masaki; Tanikawa, Ataru; Hosono, Natsuki; Nitadori, Keigo; Muranushi, Takayuki; Makino, Junichiro

2016-08-01

We present the basic idea, implementation, measured performance, and performance model of FDPS (Framework for Developing Particle Simulators). FDPS is an application-development framework which helps researchers to develop simulation programs using particle methods for large-scale distributed-memory parallel supercomputers. A particle-based simulation program for distributed-memory parallel computers needs to perform domain decomposition, exchange of particles which are not in the domain of each computing node, and gathering of the particle information in other nodes which are necessary for interaction calculation. Also, even if distributed-memory parallel computers are not used, in order to reduce the amount of computation, algorithms such as the Barnes-Hut tree algorithm or the Fast Multipole Method should be used in the case of long-range interactions. For short-range interactions, some methods to limit the calculation to neighbor particles are required. FDPS provides all of these functions which are necessary for efficient parallel execution of particle-based simulations as "templates," which are independent of the actual data structure of particles and the functional form of the particle-particle interaction. By using FDPS, researchers can write their programs with the amount of work necessary to write a simple, sequential and unoptimized program of O(N2) calculation cost, and yet the program, once compiled with FDPS, will run efficiently on large-scale parallel supercomputers. A simple gravitational N-body program can be written in around 120 lines. We report the actual performance of these programs and the performance model. The weak scaling performance is very good, and almost linear speed-up was obtained for up to the full system of the K computer. The minimum calculation time per timestep is in the range of 30 ms (N = 107) to 300 ms (N = 109). These are currently limited by the time for the calculation of the domain decomposition and communication necessary for the interaction calculation. We discuss how we can overcome these bottlenecks.
C++QEDv2 Milestone 10: A C++/Python application-programming framework for simulating open quantum dynamics

NASA Astrophysics Data System (ADS)

Sandner, Raimar; Vukics, András

2014-09-01

The v2 Milestone 10 release of C++QED is primarily a feature release, which also corrects some problems of the previous release, especially as regards the build system. The adoption of C++11 features has led to many simplifications in the codebase. A full doxygen-based API manual [1] is now provided together with updated user guides. A largely automated, versatile new testsuite directed both towards computational and physics features allows for quickly spotting arising errors. The states of trajectories are now savable and recoverable with full binary precision, allowing for trajectory continuation regardless of evolution method (single/ensemble Monte Carlo wave-function or Master equation trajectory). As the main new feature, the framework now presents Python bindings to the highest-level programming interface, so that actual simulations for given composite quantum systems can now be performed from Python. Catalogue identifier: AELU_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AELU_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: yes No. of lines in distributed program, including test data, etc.: 492422 No. of bytes in distributed program, including test data, etc.: 8070987 Distribution format: tar.gz Programming language: C++/Python. Computer: i386-i686, x86 64. Operating system: In principle cross-platform, as yet tested only on UNIX-like systems (including Mac OS X). RAM: The framework itself takes about 60MB, which is fully shared. The additional memory taken by the program which defines the actual physical system (script) is typically less than 1MB. The memory storing the actual data scales with the system dimension for state-vector manipulations, and the square of the dimension for density-operator manipulations. This might easily be GBs, and often the memory of the machine limits the size of the simulated system. Classification: 4.3, 4.13, 6.2. External routines: Boost C++ libraries, GNU Scientific Library, Blitz++, FLENS, NumPy, SciPy Catalogue identifier of previous version: AELU_v1_0 Journal reference of previous version: Comput. Phys. Comm. 183 (2012) 1381 Does the new version supersede the previous version?: Yes Nature of problem: Definition of (open) composite quantum systems out of elementary building blocks [2,3]. Manipulation of such systems, with emphasis on dynamical simulations such as Master-equation evolution [4] and Monte Carlo wave-function simulation [5]. Solution method: Master equation, Monte Carlo wave-function method Reasons for new version: The new version is mainly a feature release, but it does correct some problems of the previous version, especially as regards the build system. Summary of revisions: We give an example for a typical Python script implementing the ring-cavity system presented in Sec. 3.3 of Ref. [2]: Restrictions: Total dimensionality of the system. Master equation-few thousands. Monte Carlo wave-function trajectory-several millions. Unusual features: Because of the heavy use of compile-time algorithms, compilation of programs written in the framework may take a long time and much memory (up to several GBs). Additional comments: The framework is not a program, but provides and implements an application-programming interface for developing simulations in the indicated problem domain. We use several C++11 features which limits the range of supported compilers (g++ 4.7, clang++ 3.1) Documentation, http://cppqed.sourceforge.net/ Running time: Depending on the magnitude of the problem, can vary from a few seconds to weeks. References: [1] Entry point: http://cppqed.sf.net [2] A. Vukics, C++QEDv2: The multi-array concept and compile-time algorithms in the definition of composite quantum systems, Comp. Phys. Comm. 183(2012)1381. [3] A. Vukics, H. Ritsch, C++QED: an object-oriented framework for wave-function simulations of cavity QED systems, Eur. Phys. J. D 44 (2007) 585. [4] H. J. Carmichael, An Open Systems Approach to Quantum Optics, Springer, 1993. [5] J. Dalibard, Y. Castin, K. Molmer, Wave-function approach to dissipative processes in quantum optics, Phys. Rev. Lett. 68 (1992) 580.
Linking agent-based models and stochastic models of financial markets

PubMed Central

Feng, Ling; Li, Baowen; Podobnik, Boris; Preis, Tobias; Stanley, H. Eugene

2012-01-01

It is well-known that financial asset returns exhibit fat-tailed distributions and long-term memory. These empirical features are the main objectives of modeling efforts using (i) stochastic processes to quantitatively reproduce these features and (ii) agent-based simulations to understand the underlying microscopic interactions. After reviewing selected empirical and theoretical evidence documenting the behavior of traders, we construct an agent-based model to quantitatively demonstrate that “fat” tails in return distributions arise when traders share similar technical trading strategies and decisions. Extending our behavioral model to a stochastic model, we derive and explain a set of quantitative scaling relations of long-term memory from the empirical behavior of individual market participants. Our analysis provides a behavioral interpretation of the long-term memory of absolute and squared price returns: They are directly linked to the way investors evaluate their investments by applying technical strategies at different investment horizons, and this quantitative relationship is in agreement with empirical findings. Our approach provides a possible behavioral explanation for stochastic models for financial systems in general and provides a method to parameterize such models from market data rather than from statistical fitting. PMID:22586086
Linking agent-based models and stochastic models of financial markets.

PubMed

Feng, Ling; Li, Baowen; Podobnik, Boris; Preis, Tobias; Stanley, H Eugene

2012-05-29

It is well-known that financial asset returns exhibit fat-tailed distributions and long-term memory. These empirical features are the main objectives of modeling efforts using (i) stochastic processes to quantitatively reproduce these features and (ii) agent-based simulations to understand the underlying microscopic interactions. After reviewing selected empirical and theoretical evidence documenting the behavior of traders, we construct an agent-based model to quantitatively demonstrate that "fat" tails in return distributions arise when traders share similar technical trading strategies and decisions. Extending our behavioral model to a stochastic model, we derive and explain a set of quantitative scaling relations of long-term memory from the empirical behavior of individual market participants. Our analysis provides a behavioral interpretation of the long-term memory of absolute and squared price returns: They are directly linked to the way investors evaluate their investments by applying technical strategies at different investment horizons, and this quantitative relationship is in agreement with empirical findings. Our approach provides a possible behavioral explanation for stochastic models for financial systems in general and provides a method to parameterize such models from market data rather than from statistical fitting.
LDRD final report on massively-parallel linear programming : the parPCx system.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Parekh, Ojas; Phillips, Cynthia Ann; Boman, Erik Gunnar

2005-02-01

This report summarizes the research and development performed from October 2002 to September 2004 at Sandia National Laboratories under the Laboratory-Directed Research and Development (LDRD) project ''Massively-Parallel Linear Programming''. We developed a linear programming (LP) solver designed to use a large number of processors. LP is the optimization of a linear objective function subject to linear constraints. Companies and universities have expended huge efforts over decades to produce fast, stable serial LP solvers. Previous parallel codes run on shared-memory systems and have little or no distribution of the constraint matrix. We have seen no reports of general LP solver runsmore » on large numbers of processors. Our parallel LP code is based on an efficient serial implementation of Mehrotra's interior-point predictor-corrector algorithm (PCx). The computational core of this algorithm is the assembly and solution of a sparse linear system. We have substantially rewritten the PCx code and based it on Trilinos, the parallel linear algebra library developed at Sandia. Our interior-point method can use either direct or iterative solvers for the linear system. To achieve a good parallel data distribution of the constraint matrix, we use a (pre-release) version of a hypergraph partitioner from the Zoltan partitioning library. We describe the design and implementation of our new LP solver called parPCx and give preliminary computational results. We summarize a number of issues related to efficient parallel solution of LPs with interior-point methods including data distribution, numerical stability, and solving the core linear system using both direct and iterative methods. We describe a number of applications of LP specific to US Department of Energy mission areas and we summarize our efforts to integrate parPCx (and parallel LP solvers in general) into Sandia's massively-parallel integer programming solver PICO (Parallel Interger and Combinatorial Optimizer). We conclude with directions for long-term future algorithmic research and for near-term development that could improve the performance of parPCx.« less
Destination memory impairment in older people.

PubMed

Gopie, Nigel; Craik, Fergus I M; Hasher, Lynn

2010-12-01

Older adults are assumed to have poor destination memory-knowing to whom they tell particular information-and anecdotes about them repeating stories to the same people are cited as informal evidence for this claim. Experiment 1 assessed young and older adults' destination memory by having participants tell facts (e.g., "A dime has 118 ridges around its edge") to pictures of famous people (e.g., Oprah Winfrey). Surprise recognition memory tests, which also assessed confidence, revealed that older adults, compared to young adults, were disproportionately impaired on destination memory relative to spared memory for the individual components (i.e., facts, faces) of the episode. Older adults also were more confident that they had not told a fact to a particular person when they actually had (i.e., a miss); this presumably causes them to repeat information more often than young adults. When the direction of information transfer was reversed in Experiment 2, such that the famous people shared information with the participants (i.e., a source memory experiment), age-related memory differences disappeared. In contrast to the destination memory experiment, older adults in the source memory experiment were more confident than young adults that someone had shared a fact with them when a different person actually had shared the fact (i.e., a false alarm). Overall, accuracy and confidence jointly influence age-related changes to destination memory, a fundamental component of successful communication. (c) 2010 APA, all rights reserved).
Integrated Network Decompositions and Dynamic Programming for Graph Optimization (INDDGO)

DOE Office of Scientific and Technical Information (OSTI.GOV)

The INDDGO software package offers a set of tools for finding exact solutions to graph optimization problems via tree decompositions and dynamic programming algorithms. Currently the framework offers serial and parallel (distributed memory) algorithms for finding tree decompositions and solving the maximum weighted independent set problem. The parallel dynamic programming algorithm is implemented on top of the MADNESS task-based runtime.
MPI, HPF or OpenMP: A Study with the NAS Benchmarks

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Frumkin, Michael; Hribar, Michelle; Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)

1999-01-01

Porting applications to new high performance parallel and distributed platforms is a challenging task. Writing parallel code by hand is time consuming and costly, but the task can be simplified by high level languages and would even better be automated by parallelizing tools and compilers. The definition of HPF (High Performance Fortran, based on data parallel model) and OpenMP (based on shared memory parallel model) standards has offered great opportunity in this respect. Both provide simple and clear interfaces to language like FORTRAN and simplify many tedious tasks encountered in writing message passing programs. In our study we implemented the parallel versions of the NAS Benchmarks with HPF and OpenMP directives. Comparison of their performance with the MPI implementation and pros and cons of different approaches will be discussed along with experience of using computer-aided tools to help parallelize these benchmarks. Based on the study,potentials of applying some of the techniques to realistic aerospace applications will be presented
Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications

NASA Technical Reports Server (NTRS)

OKeefe, Matthew (Editor); Kerr, Christopher L. (Editor)

1998-01-01

This report contains the abstracts and technical papers from the Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications, held June 15-18, 1998, in Scottsdale, Arizona. The purpose of the workshop is to bring together software developers in meteorology and oceanography to discuss software engineering and code design issues for parallel architectures, including Massively Parallel Processors (MPP's), Parallel Vector Processors (PVP's), Symmetric Multi-Processors (SMP's), Distributed Shared Memory (DSM) multi-processors, and clusters. Issues to be discussed include: (1) code architectures for current parallel models, including basic data structures, storage allocation, variable naming conventions, coding rules and styles, i/o and pre/post-processing of data; (2) designing modular code; (3) load balancing and domain decomposition; (4) techniques that exploit parallelism efficiently yet hide the machine-related details from the programmer; (5) tools for making the programmer more productive; and (6) the proliferation of programming models (F--, OpenMP, MPI, and HPF).
MPI, HPF or OpenMP: A Study with the NAS Benchmarks

NASA Technical Reports Server (NTRS)

Jin, H.; Frumkin, M.; Hribar, M.; Waheed, A.; Yan, J.; Saini, Subhash (Technical Monitor)

1999-01-01

Porting applications to new high performance parallel and distributed platforms is a challenging task. Writing parallel code by hand is time consuming and costly, but this task can be simplified by high level languages and would even better be automated by parallelizing tools and compilers. The definition of HPF (High Performance Fortran, based on data parallel model) and OpenMP (based on shared memory parallel model) standards has offered great opportunity in this respect. Both provide simple and clear interfaces to language like FORTRAN and simplify many tedious tasks encountered in writing message passing programs. In our study, we implemented the parallel versions of the NAS Benchmarks with HPF and OpenMP directives. Comparison of their performance with the MPI implementation and pros and cons of different approaches will be discussed along with experience of using computer-aided tools to help parallelize these benchmarks. Based on the study, potentials of applying some of the techniques to realistic aerospace applications will be presented.
Ambipolar organic thin-film transistor-based nano-floating-gate nonvolatile memory

NASA Astrophysics Data System (ADS)

Han, Jinhua; Wang, Wei; Ying, Jun; Xie, Wenfa

2014-01-01

An ambipolar organic thin-film transistor-based nano-floating-gate nonvolatile memory was demonstrated, with discrete distributed gold nanoparticles, tetratetracontane (TTC), pentacene as the floating-gate layer, tunneling layer, and active layer, respectively. The electron traps at the TTC/pentacene interface were significantly suppressed, which resulted in an ambipolar operation in present memory. As both electrons and holes were supplied in the channel and trapped in the floating-gate by programming/erasing operations, respectively, i.e., one type of charge carriers was used to overwrite the other, trapped, one, a large memory window, extending on both sides of the initial threshold voltage, was realized.
Architecture, Design and Implementation of RC64, a Many-Core High-Performance DSP for Space Applications

NASA Astrophysics Data System (ADS)

Ginosar, Ran; Aviely, Peleg; Liran, Tuvia; Alon, Dov; Dobkin, Reuven; Goldberg, Michael

2013-08-01

RC64, a novel 64-core many-core signal processing chip targets DSP performance of 12.8 GIPS, 100 GOPS and 12.8 single precision GFLOS while dissipating only 3 Watts. RC64 employs advanced DSP cores, a multi-bank shared memory and a hardware scheduler, supports DDR2 memory and communicates over five proprietary 6.4 Gbps channels. The programming model employs sequential fine-grain tasks and a separate task map to define task dependencies. RC64 is implemented as a 200 MHz ASIC on Tower 130nm CMOS technology, assembled in hermetically sealed ceramic QFP package and qualified to the highest space standards.

SharP

DOE Office of Scientific and Technical Information (OSTI.GOV)

Venkata, Manjunath Gorentla; Aderholdt, William F

The pre-exascale systems are expected to have a significant amount of hierarchical and heterogeneous on-node memory, and this trend of system architecture in extreme-scale systems is expected to continue into the exascale era. along with hierarchical-heterogeneous memory, the system typically has a high-performing network ad a compute accelerator. This system architecture is not only effective for running traditional High Performance Computing (HPC) applications (Big-Compute), but also for running data-intensive HPC applications and Big-Data applications. As a consequence, there is a growing desire to have a single system serve the needs of both Big-Compute and Big-Data applications. Though the system architecturemore » supports the convergence of the Big-Compute and Big-Data, the programming models and software layer have yet to evolve to support either hierarchical-heterogeneous memory systems or the convergence. A programming abstraction to address this problem. The programming abstraction is implemented as a software library and runs on pre-exascale and exascale systems supporting current and emerging system architecture. Using distributed data-structures as a central concept, it provides (1) a simple, usable, and portable abstraction for hierarchical-heterogeneous memory and (2) a unified programming abstraction for Big-Compute and Big-Data applications.« less
Thread Migration in the Presence of Pointers

NASA Technical Reports Server (NTRS)

Cronk, David; Haines, Matthew; Mehrotra, Piyush

1996-01-01

Dynamic migration of lightweight threads supports both data locality and load balancing. However, migrating threads that contain pointers referencing data in both the stack and heap remains an open problem. In this paper we describe a technique by which threads with pointers referencing both stack and non-shared heap data can be migrated such that the pointers remain valid after migration. As a result, threads containing pointers can now be migrated between processors in a homogeneous distributed memory environment.
Concurrent Collections (CnC): A new approach to parallel programming

DOE Office of Scientific and Technical Information (OSTI.GOV)

Knobe, Kathleen

2010-05-07

A common approach in designing parallel languages is to provide some high level handles to manipulate the use of the parallel platform. This exposes some aspects of the target platform, for example, shared vs. distributed memory. It may expose some but not all types of parallelism, for example, data parallelism but not task parallelism. This approach must find a balance between the desire to provide a simple view for the domain expert and provide sufficient power for tuning. This is hard for any given architecture and harder if the language is to apply to a range of architectures. Either simplicitymore » or power is lost. Instead of viewing the language design problem as one of providing the programmer with high level handles, we view the problem as one of designing an interface. On one side of this interface is the programmer (domain expert) who knows the application but needs no knowledge of any aspects of the platform. On the other side of the interface is the performance expert (programmer or program) who demands maximal flexibility for optimizing the mapping to a wide range of target platforms (parallel / serial, shared / distributed, homogeneous / heterogeneous, etc.) but needs no knowledge of the domain. Concurrent Collections (CnC) is based on this separation of concerns. The talk will present CnC and its benefits. About the speaker. Kathleen Knobe has focused throughout her career on parallelism especially compiler technology, runtime system design and language design. She worked at Compass (aka Massachusetts Computer Associates) from 1980 to 1991 designing compilers for a wide range of parallel platforms for Thinking Machines, MasPar, Alliant, Numerix, and several government projects. In 1991 she decided to finish her education. After graduating from MIT in 1997, she joined Digital Equipment’s Cambridge Research Lab (CRL). She stayed through the DEC/Compaq/HP mergers and when CRL was acquired and absorbed by Intel. She currently works in the Software and Services Group / Technology Pathfinding and Innovation.« less
Concurrent Collections (CnC): A new approach to parallel programming

ScienceCinema

Knobe, Kathleen

2018-04-16

A common approach in designing parallel languages is to provide some high level handles to manipulate the use of the parallel platform. This exposes some aspects of the target platform, for example, shared vs. distributed memory. It may expose some but not all types of parallelism, for example, data parallelism but not task parallelism. This approach must find a balance between the desire to provide a simple view for the domain expert and provide sufficient power for tuning. This is hard for any given architecture and harder if the language is to apply to a range of architectures. Either simplicity or power is lost. Instead of viewing the language design problem as one of providing the programmer with high level handles, we view the problem as one of designing an interface. On one side of this interface is the programmer (domain expert) who knows the application but needs no knowledge of any aspects of the platform. On the other side of the interface is the performance expert (programmer or program) who demands maximal flexibility for optimizing the mapping to a wide range of target platforms (parallel / serial, shared / distributed, homogeneous / heterogeneous, etc.) but needs no knowledge of the domain. Concurrent Collections (CnC) is based on this separation of concerns. The talk will present CnC and its benefits. About the speaker. Kathleen Knobe has focused throughout her career on parallelism especially compiler technology, runtime system design and language design. She worked at Compass (aka Massachusetts Computer Associates) from 1980 to 1991 designing compilers for a wide range of parallel platforms for Thinking Machines, MasPar, Alliant, Numerix, and several government projects. In 1991 she decided to finish her education. After graduating from MIT in 1997, she joined Digital Equipmentâs Cambridge Research Lab (CRL). She stayed through the DEC/Compaq/HP mergers and when CRL was acquired and absorbed by Intel. She currently works in the Software and Services Group / Technology Pathfinding and Innovation.
PCI bus content-addressable-memory (CAM) implementation on FPGA for pattern recognition/image retrieval in a distributed environment

NASA Astrophysics Data System (ADS)

Megherbi, Dalila B.; Yan, Yin; Tanmay, Parikh; Khoury, Jed; Woods, C. L.

2004-11-01

Recently surveillance and Automatic Target Recognition (ATR) applications are increasing as the cost of computing power needed to process the massive amount of information continues to fall. This computing power has been made possible partly by the latest advances in FPGAs and SOPCs. In particular, to design and implement state-of-the-Art electro-optical imaging systems to provide advanced surveillance capabilities, there is a need to integrate several technologies (e.g. telescope, precise optics, cameras, image/compute vision algorithms, which can be geographically distributed or sharing distributed resources) into a programmable system and DSP systems. Additionally, pattern recognition techniques and fast information retrieval, are often important components of intelligent systems. The aim of this work is using embedded FPGA as a fast, configurable and synthesizable search engine in fast image pattern recognition/retrieval in a distributed hardware/software co-design environment. In particular, we propose and show a low cost Content Addressable Memory (CAM)-based distributed embedded FPGA hardware architecture solution with real time recognition capabilities and computing for pattern look-up, pattern recognition, and image retrieval. We show how the distributed CAM-based architecture offers a performance advantage of an order-of-magnitude over RAM-based architecture (Random Access Memory) search for implementing high speed pattern recognition for image retrieval. The methods of designing, implementing, and analyzing the proposed CAM based embedded architecture are described here. Other SOPC solutions/design issues are covered. Finally, experimental results, hardware verification, and performance evaluations using both the Xilinx Virtex-II and the Altera Apex20k are provided to show the potential and power of the proposed method for low cost reconfigurable fast image pattern recognition/retrieval at the hardware/software co-design level.
Tissue-specific programming of memory CD8 T cell subsets impacts protection against lethal respiratory virus infection

PubMed Central

Tahiliani, Vikas

2016-01-01

How tissue-specific anatomical distribution and phenotypic specialization are linked to protective efficacy of memory T cells against reinfection is unclear. Here, we show that lung environmental cues program recently recruited central-like memory cells with migratory potentials for their tissue-specific functions during lethal respiratory virus infection. After entering the lung, some central-like cells retain their original CD27hiCXCR3hi phenotype, enabling them to localize near the infected bronchiolar epithelium and airway lumen to function as the first line of defense against pathogen encounter. Others, in response to local cytokine triggers, undergo a secondary program of differentiation that leads to the loss of CXCR3, migration arrest, and clustering within peribronchoarterial areas and in interalveolar septa. Here, the immune system adapts its response to prevent systemic viral dissemination and mortality. These results reveal the striking and unexpected spatial organization of central- versus effector-like memory cells within the lung and how cooperation between these two subsets contributes to host defense. PMID:27879287
Using CLIPS in the domain of knowledge-based massively parallel programming

NASA Technical Reports Server (NTRS)

Dvorak, Jiri J.

1994-01-01

The Program Development Environment (PDE) is a tool for massively parallel programming of distributed-memory architectures. Adopting a knowledge-based approach, the PDE eliminates the complexity introduced by parallel hardware with distributed memory and offers complete transparency in respect of parallelism exploitation. The knowledge-based part of the PDE is realized in CLIPS. Its principal task is to find an efficient parallel realization of the application specified by the user in a comfortable, abstract, domain-oriented formalism. A large collection of fine-grain parallel algorithmic skeletons, represented as COOL objects in a tree hierarchy, contains the algorithmic knowledge. A hybrid knowledge base with rule modules and procedural parts, encoding expertise about application domain, parallel programming, software engineering, and parallel hardware, enables a high degree of automation in the software development process. In this paper, important aspects of the implementation of the PDE using CLIPS and COOL are shown, including the embedding of CLIPS with C++-based parts of the PDE. The appropriateness of the chosen approach and of the CLIPS language for knowledge-based software engineering are discussed.
CD4 memory T cells develop and acquire functional competence by sequential cognate interactions and stepwise gene regulation

PubMed Central

Kaji, Tomohiro; Hijikata, Atsushi; Ishige, Akiko; Kitami, Toshimori; Watanabe, Takashi; Ohara, Osamu; Yanaka, Noriyuki; Okada, Mariko; Shimoda, Michiko; Taniguchi, Masaru

2016-01-01

Memory CD4+ T cells promote protective humoral immunity; however, how memory T cells acquire this activity remains unclear. This study demonstrates that CD4+ T cells develop into antigen-specific memory T cells that can promote the terminal differentiation of memory B cells far more effectively than their naive T-cell counterparts. Memory T cell development requires the transcription factor B-cell lymphoma 6 (Bcl6), which is known to direct T-follicular helper (Tfh) cell differentiation. However, unlike Tfh cells, memory T cell development did not require germinal center B cells. Curiously, memory T cells that develop in the absence of cognate B cells cannot promote memory B-cell recall responses and this defect was accompanied by down-regulation of genes associated with homeostasis and activation and up-regulation of genes inhibitory for T-cell responses. Although memory T cells display phenotypic and genetic signatures distinct from Tfh cells, both had in common the expression of a group of genes associated with metabolic pathways. This gene expression profile was not shared to any great extent with naive T cells and was not influenced by the absence of cognate B cells during memory T cell development. These results suggest that memory T cell development is programmed by stepwise expression of gatekeeper genes through serial interactions with different types of antigen-presenting cells, first licensing the memory lineage pathway and subsequently facilitating the functional development of memory T cells. Finally, we identified Gdpd3 as a candidate genetic marker for memory T cells. PMID:26714588
HYDRA : High-speed simulation architecture for precision spacecraft formation simulation

NASA Technical Reports Server (NTRS)

Martin, Bryan J.; Sohl, Garett.

2003-01-01

e Hierarchical Distributed Reconfigurable Architecture- is a scalable simulation architecture that provides flexibility and ease-of-use which take advantage of modern computation and communication hardware. It also provides the ability to implement distributed - or workstation - based simulations and high-fidelity real-time simulation from a common core. Originally designed to serve as a research platform for examining fundamental challenges in formation flying simulation for future space missions, it is also finding use in other missions and applications, all of which can take advantage of the underlying Object-Oriented structure to easily produce distributed simulations. Hydra automates the process of connecting disparate simulation components (Hydra Clients) through a client server architecture that uses high-level descriptions of data associated with each client to find and forge desirable connections (Hydra Services) at run time. Services communicate through the use of Connectors, which abstract messaging to provide single-interface access to any desired communication protocol, such as from shared-memory message passing to TCP/IP to ACE and COBRA. Hydra shares many features with the HLA, although providing more flexibility in connectivity services and behavior overriding.
Interference due to shared features between action plans is influenced by working memory span.

PubMed

Fournier, Lisa R; Behmer, Lawrence P; Stubblefield, Alexandra M

2014-12-01

In this study, we examined the interactions between the action plans that we hold in memory and the actions that we carry out, asking whether the interference due to shared features between action plans is due to selection demands imposed on working memory. Individuals with low and high working memory spans learned arbitrary motor actions in response to two different visual events (A and B), presented in a serial order. They planned a response to the first event (A) and while maintaining this action plan in memory they then executed a speeded response to the second event (B). Afterward, they executed the action plan for the first event (A) maintained in memory. Speeded responses to the second event (B) were delayed when it shared an action feature (feature overlap) with the first event (A), relative to when it did not (no feature overlap). The size of the feature-overlap delay was greater for low-span than for high-span participants. This indicates that interference due to overlapping action plans is greater when fewer working memory resources are available, suggesting that this interference is due to selection demands imposed on working memory. Thus, working memory plays an important role in managing current and upcoming action plans, at least for newly learned tasks. Also, managing multiple action plans is compromised in individuals who have low versus high working memory spans.
Destination Memory Impairment in Older People

PubMed Central

Gopie, Nigel; Craik, Fergus I. M.; Hasher, Lynn

2012-01-01

Older adults are assumed to have poor destination memory— knowing to whom they tell particular information—and anecdotes about them repeating stories to the same people are cited as informal evidence for this claim. Experiment 1 assessed young and older adults’ destination memory by having participants tell facts (e.g., “A dime has 118 ridges around its edge”) to pictures of famous people (e.g., Oprah Winfrey). Surprise recognition memory tests, which also assessed confidence, revealed that older adults, compared to young adults, were disproportionately impaired on destination memory relative to spared memory for the individual components (i.e., facts, faces) of the episode. Older adults also were more confident that they had not told a fact to a particular person when they actually had (i.e., a miss); this presumably causes them to repeat information more often than young adults. When the direction of information transfer was reversed in Experiment 2, such that the famous people shared information with the participants (i.e., a source memory experiment), age-related memory differences disappeared. In contrast to the destination memory experiment, older adults in the source memory experiment were more confident than young adults that someone had shared a fact with them when a different person actually had shared the fact (i.e., a false alarm). Overall, accuracy and confidence jointly influence age-related changes to destination memory, a fundamental component of successful communication. PMID:20718537
Performance of the Heavy Flavor Tracker (HFT) detector in star experiment at RHIC

NASA Astrophysics Data System (ADS)

Alruwaili, Manal

With the growing technology, the number of the processors is becoming massive. Current supercomputer processing will be available on desktops in the next decade. For mass scale application software development on massive parallel computing available on desktops, existing popular languages with large libraries have to be augmented with new constructs and paradigms that exploit massive parallel computing and distributed memory models while retaining the user-friendliness. Currently, available object oriented languages for massive parallel computing such as Chapel, X10 and UPC++ exploit distributed computing, data parallel computing and thread-parallelism at the process level in the PGAS (Partitioned Global Address Space) memory model. However, they do not incorporate: 1) any extension at for object distribution to exploit PGAS model; 2) the programs lack the flexibility of migrating or cloning an object between places to exploit load balancing; and 3) lack the programming paradigms that will result from the integration of data and thread-level parallelism and object distribution. In the proposed thesis, I compare different languages in PGAS model; propose new constructs that extend C++ with object distribution and object migration; and integrate PGAS based process constructs with these extensions on distributed objects. Object cloning and object migration. Also a new paradigm MIDD (Multiple Invocation Distributed Data) is presented when different copies of the same class can be invoked, and work on different elements of a distributed data concurrently using remote method invocations. I present new constructs, their grammar and their behavior. The new constructs have been explained using simple programs utilizing these constructs.
Post-Secondary Analysis of Clothing/Textiles Technology Programs in Texas.

ERIC Educational Resources Information Center

Glosson, Linda R.; And Others

A study examined postsecondary occupational programs in clothing and textiles technology in Texas in order to (1) identify common essential competencies taught in postsecondary clothing/textiles technology programs, (2) develop and distribute student competency profiles of essential common competencies shared by the eight areas of study within…
NPS Collaborative Technology Testbed for ONR CKM Program

DTIC Science & Technology

2005-01-11

or have access to the MIT E-Wall hosted by the TOC. The combination of E-Wall and agents lend themselves to the dynamic gathering and display of...display, intuitive icons or menus that is easy to activate and customize , and automatically seeks and connects to other like services/networks/agents...integration creates network- centric memory mechanism for developing shared understanding of SA events Data Base Integration of Sensor-DM Agents and
Verified Separate Compilation for C

DTIC Science & Technology

2015-06-01

simulations, says that the visible set is closed under reachability. These two conditions, plus (6.2) and monotonicity of the REACH relation, imply...erase to a CompCert memory m. By erasure, we mean the removal of the “ juice ” that is unnecessary for execution (as in Curry-style type erasure of...simply typed lambda calculus). The “ juice ” has several components: permission shares controlling access to objects in the program logic; predicates in the
Parallel programming with Easy Java Simulations

NASA Astrophysics Data System (ADS)

Esquembre, F.; Christian, W.; Belloni, M.

2018-01-01

Nearly all of today's processors are multicore, and ideally programming and algorithm development utilizing the entire processor should be introduced early in the computational physics curriculum. Parallel programming is often not introduced because it requires a new programming environment and uses constructs that are unfamiliar to many teachers. We describe how we decrease the barrier to parallel programming by using a java-based programming environment to treat problems in the usual undergraduate curriculum. We use the easy java simulations programming and authoring tool to create the program's graphical user interface together with objects based on those developed by Kaminsky [Building Parallel Programs (Course Technology, Boston, 2010)] to handle common parallel programming tasks. Shared-memory parallel implementations of physics problems, such as time evolution of the Schrödinger equation, are available as source code and as ready-to-run programs from the AAPT-ComPADRE digital library.
PEPFAR/DOD/Pharmaccess/Tanzania Peoples Defence Forces HIV/AIDS Program

DTIC Science & Technology

2007-10-01

institutions share many features of a private company, including a hierarchy of functions, investment in training and responsibility for the health status...institutions share many features of a private company, including a hierarchy of functions, investment in training and responsibility for the health...Decks of cards will be distributed to all TPDF Units, Intelligence, Navy and Air Force bases and schools. 2000 Decks will be shared (under the Global
Marionette

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sullivan, M.; Anderson, D.P.

1988-01-01

Marionette is a system for distributed parallel programming in an environment of networked heterogeneous computer systems. It is based on a master/slave model. The master process can invoke worker operations (asynchronous remote procedure calls to single slaves) and context operations (updates to the state of all slaves). The master and slaves also interact through shared data structures that can be modified only by the master. The master and slave processes are programmed in a sequential language. The Marionette runtime system manages slave process creation, propagates shared data structures to slaves as needed, queues and dispatches worker and context operations, andmore » manages recovery from slave processor failures. The Marionette system also includes tools for automated compilation of program binaries for multiple architectures, and for distributing binaries to remote fuel systems. A UNIX-based implementation of Marionette is described.« less
Automatic generation of efficient array redistribution routines for distributed memory multicomputers

NASA Technical Reports Server (NTRS)

Ramaswamy, Shankar; Banerjee, Prithviraj

1994-01-01

Appropriate data distribution has been found to be critical for obtaining good performance on Distributed Memory Multicomputers like the CM-5, Intel Paragon and IBM SP-1. It has also been found that some programs need to change their distributions during execution for better performance (redistribution). This work focuses on automatically generating efficient routines for redistribution. We present a new mathematical representation for regular distributions called PITFALLS and then discuss algorithms for redistribution based on this representation. One of the significant contributions of this work is being able to handle arbitrary source and target processor sets while performing redistribution. Another important contribution is the ability to handle an arbitrary number of dimensions for the array involved in the redistribution in a scalable manner. Our implementation of these techniques is based on an MPI-like communication library. The results presented show the low overheads for our redistribution algorithm as compared to naive runtime methods.
Reduced distribution of threshold voltage shift in double layer NiSi2 nanocrystals for nano-floating gate memory applications.

PubMed

Choi, Sungjin; Lee, Junhyuk; Kim, Donghyoun; Oh, Seulki; Song, Wangyu; Choi, Seonjun; Choi, Eunsuk; Lee, Seung-Beck

2011-12-01

We report on the fabrication and capacitance-voltage characteristics of double layer nickel-silicide nanocrystals with Si3N4 interlayer tunnel barrier for nano-floating gate memory applications. Compared with devices using SiO2 interlayer, the use of Si3N4 interlayer separation reduced the average size (4 nm) and distribution (+/- 2.5 nm) of NiSi2 nanocrystal (NC) charge traps by more than 50% and giving a two fold increase in NC density to 2.3 x 10(12) cm(-2). The increased density and reduced NC size distribution resulted in a significantly decrease in the distribution of the device C-V characteristics. For each program voltage, the distribution of the shift in the threshold voltage was reduced by more than 50% on average to less than 0.7 V demonstrating possible multi-level-cell operation.

Low latency memory access and synchronization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.

A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processormore » only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.« less
Low latency memory access and synchronization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.

A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Bach processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processormore » only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.« less
Shared reality in intergroup communication: Increasing the epistemic authority of an out-group audience.

PubMed

Echterhoff, Gerald; Kopietz, René; Higgins, E Tory

2017-06-01

Communicators typically tune messages to their audience's attitude. Such audience tuning biases communicators' memory for the topic toward the audience's attitude to the extent that they create a shared reality with the audience. To investigate shared reality in intergroup communication, we first established that a reduced memory bias after tuning messages to an out-group (vs. in-group) audience is a subtle index of communicators' denial of shared reality to that out-group audience (Experiments 1a and 1b). We then examined whether the audience-tuning memory bias might emerge when the out-group audience's epistemic authority is enhanced, either by increasing epistemic expertise concerning the communication topic or by creating epistemic consensus among members of a multiperson out-group audience. In Experiment 2, when Germans communicated to a Turkish audience with an attitude about a Turkish (vs. German) target, the audience-tuning memory bias appeared. In Experiment 3, when the audience of German communicators consisted of 3 Turks who all held the same attitude toward the target, the memory bias again appeared. The association between message valence and memory valence was consistently higher when the audience's epistemic authority was high (vs. low). An integrative analysis across all studies also suggested that the memory bias increases with increasing strength of epistemic inputs (epistemic expertise, epistemic consensus, and audience-tuned message production). The findings suggest novel ways of overcoming intergroup biases in intergroup relations. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
The Developmental Influence of Primary Memory Capacity on Working Memory and Academic Achievement

PubMed Central

2015-01-01

In this study, we investigate the development of primary memory capacity among children. Children between the ages of 5 and 8 completed 3 novel tasks (split span, interleaved lists, and a modified free-recall task) that measured primary memory by estimating the number of items in the focus of attention that could be spontaneously recalled in serial order. These tasks were calibrated against traditional measures of simple and complex span. Clear age-related changes in these primary memory estimates were observed. There were marked individual differences in primary memory capacity, but each novel measure was predictive of simple span performance. Among older children, each measure shared variance with reading and mathematics performance, whereas for younger children, the interleaved lists task was the strongest single predictor of academic ability. We argue that these novel tasks have considerable potential for the measurement of primary memory capacity and provide new, complementary ways of measuring the transient memory processes that predict academic performance. The interleaved lists task also shared features with interference control tasks, and our findings suggest that young children have a particular difficulty in resisting distraction and that variance in the ability to resist distraction is also shared with measures of educational attainment. PMID:26075630
The developmental influence of primary memory capacity on working memory and academic achievement.

PubMed

Hall, Debbora; Jarrold, Christopher; Towse, John N; Zarandi, Amy L

2015-08-01

In this study, we investigate the development of primary memory capacity among children. Children between the ages of 5 and 8 completed 3 novel tasks (split span, interleaved lists, and a modified free-recall task) that measured primary memory by estimating the number of items in the focus of attention that could be spontaneously recalled in serial order. These tasks were calibrated against traditional measures of simple and complex span. Clear age-related changes in these primary memory estimates were observed. There were marked individual differences in primary memory capacity, but each novel measure was predictive of simple span performance. Among older children, each measure shared variance with reading and mathematics performance, whereas for younger children, the interleaved lists task was the strongest single predictor of academic ability. We argue that these novel tasks have considerable potential for the measurement of primary memory capacity and provide new, complementary ways of measuring the transient memory processes that predict academic performance. The interleaved lists task also shared features with interference control tasks, and our findings suggest that young children have a particular difficulty in resisting distraction and that variance in the ability to resist distraction is also shared with measures of educational attainment. (c) 2015 APA, all rights reserved).
The effects of nongenetic memory on population level sensitivity to stress

NASA Astrophysics Data System (ADS)

Adams, Rhys; Nevozhay, Dmitry; van Itallie, Elizabeth; Bennett, Matthew; Balazsi, Gabor

2011-03-01

While gene expression is often thought of as a unidirectional determinant of cellular fitness, recent studies have shown how growth retardation due to protein expression can affect gene expression levels in single cells. We developed two yeast strains carrying a drug resistance protein under the control of different synthetic gene constructs, one of which was monostable, while the other was bistable. The gene expression of these cell populations was tuned using a molecular inducer so that their respective means and noises were identical, while their nongenetic memory properties were different. We tested the sensitivity of these two cell population distributions to the antibiotic zeocin. We found that the gene expression distributions of bistable cell populations were sensitive to stressful environments, while the gene expression distribution of monostable cells were nearly unchanged by stress. We conclude that cell populations with high nongenetic memory are more adaptable to their environment. This work was funded by the National Institutes of Health through the NIH Director's New Innovator Award Program, 1-DP2- OD006481-01.
Measuring Transactiving Memory Systems Using Network Analysis

ERIC Educational Resources Information Center

King, Kylie Goodell

2017-01-01

Transactive memory systems (TMSs) describe the structures and processes that teams use to share information, work together, and accomplish shared goals. First introduced over three decades ago, TMSs have been measured in a variety of ways. This dissertation proposes the use of network analysis in measuring TMS. This is accomplished by describing…
Operator Influence of Unexploded Ordnance Sensor Technologies

DTIC Science & Technology

2007-03-01

chart display ActiveX control Mscomct2.dll – date/time display ActiveX control Pnpscr.dll – Systran SCRAMNet replicated shared memory device...response value database rgm_p2.dll – Phase 2 shared memory API and implementation Commercial components StripM.ocx – strip chart display ActiveX
Runtime support for parallelizing data mining algorithms

NASA Astrophysics Data System (ADS)

Jin, Ruoming; Agrawal, Gagan

2002-03-01

With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of common data mining algorithms. In addition, we propose a reduction-object based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the technique we have developed starting from a common specification of the algorithm.
Parallelization of a Monte Carlo particle transport simulation code

NASA Astrophysics Data System (ADS)

Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.

2010-05-01

We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
3D Navier-Stokes Flow Analysis for Shared and Distributed Memory MIMD Computers

DTIC Science & Technology

1992-09-15

arithmetical averaged density or Stefan -Boltzmann constant (= 5.67032 x 10-8 ) Oai+1/2 intermediate term for Harten-Yee fluxes - k, O’ constants for k...system of algebraic equations. These equations I are solved using point Gauss- Seidel relaxation. This relaxation scheme is modified to be a lower-upper...interaction of the radiation with the gas. The radiative heat flux per unit area is then I = -(T [EwT - awTdb] (19) Here a is the Stefan Boltzmann
Concurrent working memory load can facilitate selective attention: evidence for specialized load.

PubMed

Park, Soojin; Kim, Min-Shik; Chun, Marvin M

2007-10-01

Load theory predicts that concurrent working memory load impairs selective attention and increases distractor interference (N. Lavie, A. Hirst, J. W. de Fockert, & E. Viding). Here, the authors present new evidence that the type of concurrent working memory load determines whether load impairs selective attention or not. Working memory load was paired with a same/different matching task that required focusing on targets while ignoring distractors. When working memory items shared the same limited-capacity processing mechanisms with targets in the matching task, distractor interference increased. However, when working memory items shared processing with distractors in the matching task, distractor interference decreased, facilitating target selection. A specialized load account is proposed to describe the dissociable effects of working memory load on selective processing depending on whether the load overlaps with targets or with distractors. (c) 2007 APA
Internet Technology in Magnetic Resonance: A Common Gateway Interface Program for the World-Wide Web NMR Spectrometer

NASA Astrophysics Data System (ADS)

Buszko, Marian L.; Buszko, Dominik; Wang, Daniel C.

1998-04-01

A custom-written Common Gateway Interface (CGI) program for remote control of an NMR spectrometer using a World Wide Web browser has been described. The program, running on a UNIX workstation, uses multiple processes to handle concurrent tasks of interacting with the user and with the spectrometer. The program's parent process communicates with the browser and sends out commands to the spectrometer; the child process is mainly responsible for data acquisition. Communication between the processes is via the shared memory mechanism. The WWW pages that have been developed for the system make use of the frames feature of web browsers. The CGI program provides an intuitive user interface to the NMR spectrometer, making, in effect, a complex system an easy-to-use Web appliance.
Sharing programming resources between Bio* projects through remote procedure call and native call stack strategies.

PubMed

Prins, Pjotr; Goto, Naohisa; Yates, Andrew; Gautier, Laurent; Willis, Scooter; Fields, Christopher; Katayama, Toshiaki

2012-01-01

Open-source software (OSS) encourages computer programmers to reuse software components written by others. In evolutionary bioinformatics, OSS comes in a broad range of programming languages, including C/C++, Perl, Python, Ruby, Java, and R. To avoid writing the same functionality multiple times for different languages, it is possible to share components by bridging computer languages and Bio* projects, such as BioPerl, Biopython, BioRuby, BioJava, and R/Bioconductor. In this chapter, we compare the two principal approaches for sharing software between different programming languages: either by remote procedure call (RPC) or by sharing a local call stack. RPC provides a language-independent protocol over a network interface; examples are RSOAP and Rserve. The local call stack provides a between-language mapping not over the network interface, but directly in computer memory; examples are R bindings, RPy, and languages sharing the Java Virtual Machine stack. This functionality provides strategies for sharing of software between Bio* projects, which can be exploited more often. Here, we present cross-language examples for sequence translation, and measure throughput of the different options. We compare calling into R through native R, RSOAP, Rserve, and RPy interfaces, with the performance of native BioPerl, Biopython, BioJava, and BioRuby implementations, and with call stack bindings to BioJava and the European Molecular Biology Open Software Suite. In general, call stack approaches outperform native Bio* implementations and these, in turn, outperform RPC-based approaches. To test and compare strategies, we provide a downloadable BioNode image with all examples, tools, and libraries included. The BioNode image can be run on VirtualBox-supported operating systems, including Windows, OSX, and Linux.
Transactive memory systems scale for couples: development and validation

PubMed Central

Hewitt, Lauren Y.; Roberts, Lynne D.

2015-01-01

People in romantic relationships can develop shared memory systems by pooling their cognitive resources, allowing each person access to more information but with less cognitive effort. Research examining such memory systems in romantic couples largely focuses on remembering word lists or performing lab-based tasks, but these types of activities do not capture the processes underlying couples’ transactive memory systems, and may not be representative of the ways in which romantic couples use their shared memory systems in everyday life. We adapted an existing measure of transactive memory systems for use with romantic couples (TMSS-C), and conducted an initial validation study. In total, 397 participants who each identified as being a member of a romantic relationship of at least 3 months duration completed the study. The data provided a good fit to the anticipated three-factor structure of the components of couples’ transactive memory systems (specialization, credibility and coordination), and there was reasonable evidence of both convergent and divergent validity, as well as strong evidence of test–retest reliability across a 2-week period. The TMSS-C provides a valuable tool that can quickly and easily capture the underlying components of romantic couples’ transactive memory systems. It has potential to help us better understand this intriguing feature of romantic relationships, and how shared memory systems might be associated with other important features of romantic relationships. PMID:25999873
Distributed Processor/Memory Architectures Design Program

DTIC Science & Technology

1975-02-01

Event Scheduling Plo 31 Globat LAl Message Input Event Sicheduling Fhou ..... ............... 106 32 It tc Iata Representation...298 138 GEX LEX Scheduling Phlmophy ....... ...................... 300 139 Executive Comirol Herarchy... Scheduler Subroutine lnterrelatiomhips . ..... ................. 312 145 Task Scheduler Message Scatuer. . ...... ....................... 315 146
Parallel algorithms for modeling flow in permeable media. Annual report, February 15, 1995 - February 14, 1996

DOE Office of Scientific and Technical Information (OSTI.GOV)

G.A. Pope; K. Sephernoori; D.C. McKinney

1996-03-15

This report describes the application of distributed-memory parallel programming techniques to a compositional simulator called UTCHEM. The University of Texas Chemical Flooding reservoir simulator (UTCHEM) is a general-purpose vectorized chemical flooding simulator that models the transport of chemical species in three-dimensional, multiphase flow through permeable media. The parallel version of UTCHEM addresses solving large-scale problems by reducing the amount of time that is required to obtain the solution as well as providing a flexible and portable programming environment. In this work, the original parallel version of UTCHEM was modified and ported to CRAY T3D and CRAY T3E, distributed-memory, multiprocessor computersmore » using CRAY-PVM as the interprocessor communication library. Also, the data communication routines were modified such that the portability of the original code across different computer architectures was mad possible.« less
Runtime Systems for Extreme Scale Platforms

DTIC Science & Technology

2013-12-01

Programming in Java, PPPJ ’11, (New York, NY, USA), pp. 51–61, ACM, 2011. [76] Y. Guo, R. Barik , R. Raman, and V. Sarkar, “Work-first and help-first...95] R. Barik , J. Zhao, D. Grove, I. Peshansky, Z. Budimlić, and V. Sarkar, “Commu- nication Optimizations for Distributed-Memory X10 Programs,” in
Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

DOE Office of Scientific and Technical Information (OSTI.GOV)

Levine, Benjamin G., E-mail: ben.levine@temple.ed; Stone, John E., E-mail: johns@ks.uiuc.ed; Kohlmeyer, Axel, E-mail: akohlmey@temple.ed

2011-05-01

The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU's memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm aremore » presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 s per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis.« less
Fast Analysis of Molecular Dynamics Trajectories with Graphics Processing Units—Radial Distribution Function Histogramming

PubMed Central

Stone, John E.; Kohlmeyer, Axel

2011-01-01

The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU’s memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm are presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 seconds per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis. PMID:21547007

YAPPA: a Compiler-Based Parallelization Framework for Irregular Applications on MPSoCs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lovergine, Silvia; Tumeo, Antonino; Villa, Oreste

Modern embedded systems include hundreds of cores. Because of the difficulty in providing a fast, coherent memory architecture, these systems usually rely on non-coherent, non-uniform memory architectures with private memories for each core. However, programming these systems poses significant challenges. The developer must extract large amounts of parallelism, while orchestrating communication among cores to optimize application performance. These issues become even more significant with irregular applications, which present data sets difficult to partition, unpredictable memory accesses, unbalanced control flow and fine grained communication. Hand-optimizing every single aspect is hard and time-consuming, and it often does not lead to the expectedmore » performance. There is a growing gap between such complex and highly-parallel architectures and the high level languages used to describe the specification, which were designed for simpler systems and do not consider these new issues. In this paper we introduce YAPPA (Yet Another Parallel Programming Approach), a compilation framework for the automatic parallelization of irregular applications on modern MPSoCs based on LLVM. We start by considering an efficient parallel programming approach for irregular applications on distributed memory systems. We then propose a set of transformations that can reduce the development and optimization effort. The results of our initial prototype confirm the correctness of the proposed approach.« less
DMA shared byte counters in a parallel computer

DOEpatents

Chen, Dong; Gara, Alan G.; Heidelberger, Philip; Vranas, Pavlos

2010-04-06

A parallel computer system is constructed as a network of interconnected compute nodes. Each of the compute nodes includes at least one processor, a memory and a DMA engine. The DMA engine includes a processor interface for interfacing with the at least one processor, DMA logic, a memory interface for interfacing with the memory, a DMA network interface for interfacing with the network, injection and reception byte counters, injection and reception FIFO metadata, and status registers and control registers. The injection FIFOs maintain memory locations of the injection FIFO metadata memory locations including its current head and tail, and the reception FIFOs maintain the reception FIFO metadata memory locations including its current head and tail. The injection byte counters and reception byte counters may be shared between messages.
Division of attention as a function of the number of steps, visual shifts, and memory load

NASA Technical Reports Server (NTRS)

Chechile, R. A.; Butler, K.; Gutowski, W.; Palmer, E. A.

1986-01-01

The effects on divided attention of visual shifts and long-term memory retrieval during a monitoring task are considered. A concurrent vigilance task was standardized under all experimental conditions. The results show that subjects can perform nearly perfectly on all of the time-shared tasks if long-term memory retrieval is not required for monitoring. With the requirement of memory retrieval, however, there was a large decrease in accuracy for all of the time-shared activities. It was concluded that the attentional demand of longterm memory retrieval is appreciable (even for a well-learned motor sequence), and thus memory retrieval results in a sizable reduction in the capability of subjects to divide their attention. A selected bibliography on the divided attention literature is provided.
Personal semantics: Is it distinct from episodic and semantic memory? An electrophysiological study of memory for autobiographical facts and repeated events in honor of Shlomo Bentin.

PubMed

Renoult, Louis; Tanguay, Annick; Beaudry, Myriam; Tavakoli, Paniz; Rabipour, Sheida; Campbell, Kenneth; Moscovitch, Morris; Levine, Brian; Davidson, Patrick S R

2016-03-01

Declarative memory is thought to consist of two independent systems: episodic and semantic. Episodic memory represents personal and contextually unique events, while semantic memory represents culturally-shared, acontextual factual knowledge. Personal semantics refers to aspects of declarative memory that appear to fall somewhere in between the extremes of episodic and semantic. Examples include autobiographical knowledge and memories of repeated personal events. These two aspects of personal semantics have been studied little and rarely compared to both semantic and episodic memory. We recorded the event-related potentials (ERPs) of 27 healthy participants while they verified the veracity of sentences probing four types of questions: general (i.e., semantic) facts, autobiographical facts, repeated events, and unique (i.e., episodic) events. Behavioral results showed equivalent reaction times in all 4 conditions. True sentences were verified faster than false sentences, except for unique events for which no significant difference was observed. Electrophysiological results showed that the N400 (which is classically associated with retrieval from semantic memory) was maximal for general facts and the LPC (which is classically associated with retrieval from episodic memory) was maximal for unique events. For both ERP components, the two personal semantic conditions (i.e., autobiographical facts and repeated events) systematically differed from semantic memory. In addition, N400 amplitudes also differentiated autobiographical facts from unique events. Autobiographical facts and repeated events did not differ significantly from each other but their corresponding scalp distributions differed from those associated with general facts. Our results suggest that the neural correlates of personal semantics can be distinguished from those of semantic and episodic memory, and may provide clues as to how unique events are transformed to semantic memory. Copyright © 2015 Elsevier Ltd. All rights reserved.
RC64, a Rad-Hard Many-Core High- Performance DSP for Space Applications

NASA Astrophysics Data System (ADS)

Ginosar, Ran; Aviely, Peleg; Gellis, Hagay; Liran, Tuvia; Israeli, Tsvika; Nesher, Roy; Lange, Fredy; Dobkin, Reuven; Meirov, Henri; Reznik, Dror

2015-09-01

RC64, a novel rad-hard 64-core signal processing chip targets DSP performance of 75 GMACs (16bit), 150 GOPS and 38 single precision GFLOPS while dissipating less than 10 Watts. RC64 integrates advanced DSP cores with a multi-bank shared memory and a hardware scheduler, also supporting DDR2/3 memory and twelve 3.125 Gbps full duplex high speed serial links using SpaceFibre and other protocols. The programming model employs sequential fine-grain tasks and a separate task map to define task dependencies. RC64 is implemented as a 300 MHz integrated circuit on a 65nm CMOS technology, assembled in hermetically sealed ceramic CCGA624 package and qualified to the highest space standards.
RC64, a Rad-Hard Many-Core High-Performance DSP for Space Applications

NASA Astrophysics Data System (ADS)

Ginosar, Ran; Aviely, Peleg; Liran, Tuvia; Alon, Dov; Mandler, Alberto; Lange, Fredy; Dobkin, Reuven; Goldberg, Miki

2014-08-01

RC64, a novel rad-hard 64-core signal processing chip targets DSP performance of 75 GMACs (16bit), 150 GOPS and 20 single precision GFLOPS while dissipating less than 10 Watts. RC64 integrates advanced DSP cores with a multi-bank shared memory and a hardware scheduler, also supporting DDR2/3 memory and twelve 2.5 Gbps full duplex high speed serial links using SpaceFibre and other protocols. The programming model employs sequential fine-grain tasks and a separate task map to define task dependencies. RC64 is implemented as a 300 MHz integrated circuit on a 65nm CMOS technology, assembled in hermetically sealed ceramic CCGA624 package and qualified to the highest space standards.
Kanerva's sparse distributed memory: An associative memory algorithm well-suited to the Connection Machine

NASA Technical Reports Server (NTRS)

Rogers, David

1988-01-01

The advent of the Connection Machine profoundly changes the world of supercomputers. The highly nontraditional architecture makes possible the exploration of algorithms that were impractical for standard Von Neumann architectures. Sparse distributed memory (SDM) is an example of such an algorithm. Sparse distributed memory is a particularly simple and elegant formulation for an associative memory. The foundations for sparse distributed memory are described, and some simple examples of using the memory are presented. The relationship of sparse distributed memory to three important computational systems is shown: random-access memory, neural networks, and the cerebellum of the brain. Finally, the implementation of the algorithm for sparse distributed memory on the Connection Machine is discussed.
Welcoming nora: a family event.

PubMed

Walsh, Allison J; Walsh, Paul R; Walsh, Jane M; Walsh, Gavin T

2011-01-01

In this column, Allison and Paul Walsh share the story of the birth of Nora, their third baby and their second child to be born at home. Allison and Paul share their individual memories of labor and birth. But their story is only part of the story of Nora's birth. Nora's birth was a family event, with Allison and Paul's other children very much part of the experience. Jane and Gavin share their own memories of their baby sister's birth.
Fast distributed large-pixel-count hologram computation using a GPU cluster.

PubMed

Pan, Yuechao; Xu, Xuewu; Liang, Xinan

2013-09-10

Large-pixel-count holograms are one essential part for big size holographic three-dimensional (3D) display, but the generation of such holograms is computationally demanding. In order to address this issue, we have built a graphics processing unit (GPU) cluster with 32.5 Tflop/s computing power and implemented distributed hologram computation on it with speed improvement techniques, such as shared memory on GPU, GPU level adaptive load balancing, and node level load distribution. Using these speed improvement techniques on the GPU cluster, we have achieved 71.4 times computation speed increase for 186M-pixel holograms. Furthermore, we have used the approaches of diffraction limits and subdivision of holograms to overcome the GPU memory limit in computing large-pixel-count holograms. 745M-pixel and 1.80G-pixel holograms were computed in 343 and 3326 s, respectively, for more than 2 million object points with RGB colors. Color 3D objects with 1.02M points were successfully reconstructed from 186M-pixel hologram computed in 8.82 s with all the above three speed improvement techniques. It is shown that distributed hologram computation using a GPU cluster is a promising approach to increase the computation speed of large-pixel-count holograms for large size holographic display.
Colouring in the Blanks: Memory Drawings of the 1990 Kuwait Invasion

ERIC Educational Resources Information Center

Pepin-Wakefield, Yvonne

2009-01-01

This study used drawing tasks to examine the similarities and differences between females and males who shared a collective traumatic event in early childhood. Could these childhood memories be recorded, measured, and compared for gender differences in drawings by young adults who had shared a similar experience as children? Exploration of this…
Functions of Memory Sharing and Mother-Child Reminiscing Behaviors: Individual and Cultural Variations

ERIC Educational Resources Information Center

Kulkofsky, Sarah; Wang, Qi; Koh, Jessie Bee Kim

2009-01-01

This study examined maternal beliefs about the functions of memory sharing and the relations between these beliefs and mother-child reminiscing behaviors in a cross-cultural context. Sixty-three European American and 47 Chinese mothers completed an open-ended questionnaire concerning their beliefs about the functions of parent-child memory…
Block-Parallel Data Analysis with DIY2

DOE Office of Scientific and Technical Information (OSTI.GOV)

Morozov, Dmitriy; Peterka, Tom

DIY2 is a programming model and runtime for block-parallel analytics on distributed-memory machines. Its main abstraction is block-structured data parallelism: data are decomposed into blocks; blocks are assigned to processing elements (processes or threads); computation is described as iterations over these blocks, and communication between blocks is defined by reusable patterns. By expressing computation in this general form, the DIY2 runtime is free to optimize the movement of blocks between slow and fast memories (disk and flash vs. DRAM) and to concurrently execute blocks residing in memory with multiple threads. This enables the same program to execute in-core, out-of-core, serial,more » parallel, single-threaded, multithreaded, or combinations thereof. This paper describes the implementation of the main features of the DIY2 programming model and optimizations to improve performance. DIY2 is evaluated on benchmark test cases to establish baseline performance for several common patterns and on larger complete analysis codes running on large-scale HPC machines.« less
Unleashing the Power of Distributed CPU/GPU Architectures: Massive Astronomical Data Analysis and Visualization Case Study

NASA Astrophysics Data System (ADS)

Hassan, A. H.; Fluke, C. J.; Barnes, D. G.

2012-09-01

Upcoming and future astronomy research facilities will systematically generate terabyte-sized data sets moving astronomy into the Petascale data era. While such facilities will provide astronomers with unprecedented levels of accuracy and coverage, the increases in dataset size and dimensionality will pose serious computational challenges for many current astronomy data analysis and visualization tools. With such data sizes, even simple data analysis tasks (e.g. calculating a histogram or computing data minimum/maximum) may not be achievable without access to a supercomputing facility. To effectively handle such dataset sizes, which exceed today's single machine memory and processing limits, we present a framework that exploits the distributed power of GPUs and many-core CPUs, with a goal of providing data analysis and visualizing tasks as a service for astronomers. By mixing shared and distributed memory architectures, our framework effectively utilizes the underlying hardware infrastructure handling both batched and real-time data analysis and visualization tasks. Offering such functionality as a service in a “software as a service” manner will reduce the total cost of ownership, provide an easy to use tool to the wider astronomical community, and enable a more optimized utilization of the underlying hardware infrastructure.
Stillbirth and stigma: the spoiling and repair of multiple social identities.

PubMed

Brierley-Jones, Lyn; Crawley, Rosalind; Lomax, Samantha; Ayers, Susan

This study investigated mothers' experiences surrounding stillbirth in the United Kingdom, their memory making and sharing opportunities, and the effect these opportunities had on them. Qualitative data were generated from free text responses to open-ended questions. Thematic content analysis revealed that "stigma" was experienced by most women and Goffman's (1963) work on stigma was subsequently used as an analytical framework. Results suggest that stillbirth can spoil the identities of "patient," "mother," and "full citizen." Stigma was reported as arising from interactions with professionals, family, friends, work colleagues, and even casual acquaintances. Stillbirth produces common learning experiences often requiring "identity work" (Murphy, 2012). Memory making and sharing may be important in this work and further research is needed. Stigma can reduce the memory sharing opportunities for women after stillbirth and this may explain some of the differential mental health effects of memory making after stillbirth that is documented in the literature.
Keep your bias to yourself: How deliberating with differently biased others affects mock-jurors' guilt decisions, perceptions of the defendant, memories, and evidence interpretation.

PubMed

Ruva, Christine L; Guenther, Christina C

2017-10-01

This experiment explored how mock-jurors' (N = 648) guilt decisions, perceptions of the defendant, memories, and evidence interpretation varied as a function of jury type and pretrial publicity (PTP); utilizing a 2 (jury type: pure-PTP vs. mixed-PTP) × 3 (PTP: defendant, victim, and irrelevant) factorial design. Mock-juries (N = 126) were composed of jurors exposed to the same type of PTP (pure-PTP; e.g., defendant-PTP) or different types of PTP (mixed-PTP; e.g., half exposed to defendant-PTP and half to irrelevant-PTP). Before deliberations jurors exposed to defendant-PTP were most likely to vote guilty; while those exposed to victim-PTP were least likely. After deliberations, jury type and PTP affected jurors' guilt decisions. Specifically, jurors deliberating on pure-PTP juries had verdict distributions that closely resembled the predeliberation distributions. The verdict distributions of jurors on mixed-PTP juries suggested that jurors were influenced by those they deliberated with. Jurors not exposed to PTP appeared to incorporate bias from PTP-exposed jurors. Only PTP had significant effects on postdeliberation measures of memory and evidence interpretation. Mediation analyses revealed that evidence interpretation and defendant credibility assessments mediated the effect of PTP on guilt ratings. Taken together these findings suggest that during deliberations PTP bias can spread to jurors not previously exposed to PTP. In addition, juries composed of jurors exposed to different PTP slants, as opposed to a single PTP slant, can result in less biased decisions. Finally, deliberating with others who do not share similar biases may have little, if any, impact on biased evidence interpretation or memory errors. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Brain Information Sharing During Visual Short-Term Memory Binding Yields a Memory Biomarker for Familial Alzheimer's Disease.

PubMed

Parra, Mario A; Mikulan, Ezequiel; Trujillo, Natalia; Sala, Sergio Della; Lopera, Francisco; Manes, Facundo; Starr, John; Ibanez, Agustin

2017-01-01

Alzheimer's disease (AD) as a disconnection syndrome which disrupts both brain information sharing and memory binding functions. The extent to which these two phenotypic expressions share pathophysiological mechanisms remains unknown. To unveil the electrophysiological correlates of integrative memory impairments in AD towards new memory biomarkers for its prodromal stages. Patients with 100% risk of familial AD (FAD) and healthy controls underwent assessment with the Visual Short-Term Memory binding test (VSTMBT) while we recorded their EEG. We applied a novel brain connectivity method (Weighted Symbolic Mutual Information) to EEG data. Patients showed significant deficits during the VSTMBT. A reduction of brain connectivity was observed during resting as well as during correct VSTM binding, particularly over frontal and posterior regions. An increase of connectivity was found during VSTM binding performance over central regions. While decreased connectivity was found in cases in more advanced stages of FAD, increased brain connectivity appeared in cases in earlier stages. Such altered patterns of task-related connectivity were found in 89% of the assessed patients. VSTM binding in the prodromal stages of FAD are associated to altered patterns of brain connectivity thus confirming the link between integrative memory deficits and impaired brain information sharing in prodromal FAD. While significant loss of brain connectivity seems to be a feature of the advanced stages of FAD increased brain connectivity characterizes its earlier stages. These findings are discussed in the light of recent proposals about the earliest pathophysiological mechanisms of AD and their clinical expression. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
The N400 reveals how personal semantics is processed: Insights into the nature and organization of self-knowledge

PubMed Central

Federmeier, Kara D.

2017-01-01

There is growing recognition that some important forms of long-term memory are difficult to classify into one of the well-studied memory subtypes. One example is personal semantics. Like the episodes that are stored as part of one’s autobiography, personal semantics is linked to an individual, yet, like general semantic memory, it is detached from a specific encoding context. Access to general semantics elicits an electrophysiological response known as the N400, which has been characterized across three decades of research; surprisingly, this response has not been fully examined in the context of personal semantics. In this study, we assessed responses to congruent and incongruent statements about people’s own, personal preferences. We found that access to personal preferences elicited N400 responses, with congruency effects that were similar in latency and distribution to those for general semantic statements elicited from the same participants. These results suggest that the processing of personal and general semantics share important functional and neurobiological features. PMID:26825011
A distributed model: redefining a robust research subject advocacy program at the Harvard Clinical and Translational Science Center.

PubMed

Winkler, Sabune J; Cagliero, Enrico; Witte, Elizabeth; Bierer, Barbara E

2014-08-01

The Harvard Clinical and Translational Science Center ("Harvard Catalyst") Research Subject Advocacy (RSA) Program has reengineered subject advocacy, distributing the delivery of advocacy functions through a multi-institutional, central platform rather than vesting these roles and responsibilities in a single individual functioning as a subject advocate. The program is process-oriented and output-driven, drawing on the strengths of participating institutions to engage local stakeholders both in the protection of research subjects and in advocacy for subjects' rights. The program engages stakeholder communities in the collaborative development and distributed delivery of accessible and applicable educational programming and resources. The Harvard Catalyst RSA Program identifies, develops, and supports the sharing and distribution of expertise, education, and resources for the benefit of all institutions, with a particular focus on the frontline: research subjects, researchers, research coordinators, and research nurses. © 2014 Wiley Periodicals, Inc.
Advancing Translational Space Research Through Biospecimen Sharing: Amplified Impact of Studies Utilizing Analogue Space Platforms

NASA Technical Reports Server (NTRS)

Staten, B.; Moyer, E.; Vizir, V.; Gompf, H.; Hoban-Higgins, T.; Lewis, L.; Ronca, A.; Fuller, C. A.

2016-01-01

Biospecimen Sharing Programs (BSPs) have been organized by NASA Ames Research Center since the 1960s with the goal of maximizing utilization and scientific return from rare, complex and costly spaceflight experiments. BSPs involve acquiring otherwise unused biological specimens from primary space research experiments for distribution to secondary experiments. Here we describe a collaboration leveraging Ames expertise in biospecimen sharing to magnify the scientific impact of research informing astronaut health funded by the NASA Human Research Program (HRP) Human Health Countermeasures (HHC) Element. The concept expands biospecimen sharing to one-off ground-based studies utilizing analogue space platforms (e.g., Hindlimb Unloading (HLU), Artificial Gravity) for rodent experiments, thereby significantly broadening the range of research opportunities with translational relevance for protecting human health in space and on Earth.
Over-Distribution in Source Memory

PubMed Central

Brainerd, C. J.; Reyna, V. F.; Holliday, R. E.; Nakamura, K.

2012-01-01

Semantic false memories are confounded with a second type of error, over-distribution, in which items are attributed to contradictory episodic states. Over-distribution errors have proved to be more common than false memories when the two are disentangled. We investigated whether over-distribution is prevalent in another classic false memory paradigm: source monitoring. It is. Conventional false memory responses (source misattributions) were predominantly over-distribution errors, but unlike semantic false memory, over-distribution also accounted for more than half of true memory responses (correct source attributions). Experimental control of over-distribution was achieved via a series of manipulations that affected either recollection of contextual details or item memory (concreteness, frequency, list-order, number of presentation contexts, and individual differences in verbatim memory). A theoretical model was used to analyze the data (conjoint process dissociation) that predicts that predicts that (a) over-distribution is directly proportional to item memory but inversely proportional to recollection and (b) item memory is not a necessary precondition for recollection of contextual details. The results were consistent with both predictions. PMID:21942494

Decomposing the relationship between cognitive functioning and self-referent memory beliefs in older adulthood: What’s memory got to do with it?

PubMed Central

Payne, Brennan R.; Gross, Alden L.; Hill, Patrick L.; Parisi, Jeanine M.; Rebok, George W.; Stine-Morrow, Elizabeth A. L.

2018-01-01

With advancing age, episodic memory performance shows marked declines along with concurrent reports of lower subjective memory beliefs. Given that normative age-related declines in episodic memory co-occur with declines in other cognitive domains, we examined the relationship between memory beliefs and multiple domains of cognitive functioning. Confirmatory bi-factor structural equation models were used to parse the shared and independent variance among factors representing episodic memory, psychomotor speed, and executive reasoning in one large cohort study (Senior Odyssey, N = 462), and replicated using another large cohort of healthy older adults (ACTIVE, N = 2,802). Accounting for a general fluid cognitive functioning factor (comprised of the shared variance among measures of episodic memory, speed, and reasoning) attenuated the relationship between objective memory performance and subjective memory beliefs in both samples. Moreover, the general cognitive functioning factor was the strongest predictor of memory beliefs in both samples. These findings are consistent with the notion that dispositional memory beliefs may reflect perceptions of cognition more broadly. This may be one reason why memory beliefs have broad predictive validity for interventions that target fluid cognitive ability. PMID:27685541
Decomposing the relationship between cognitive functioning and self-referent memory beliefs in older adulthood: what's memory got to do with it?

PubMed

Payne, Brennan R; Gross, Alden L; Hill, Patrick L; Parisi, Jeanine M; Rebok, George W; Stine-Morrow, Elizabeth A L

2017-07-01

With advancing age, episodic memory performance shows marked declines along with concurrent reports of lower subjective memory beliefs. Given that normative age-related declines in episodic memory co-occur with declines in other cognitive domains, we examined the relationship between memory beliefs and multiple domains of cognitive functioning. Confirmatory bi-factor structural equation models were used to parse the shared and independent variance among factors representing episodic memory, psychomotor speed, and executive reasoning in one large cohort study (Senior Odyssey, N = 462), and replicated using another large cohort of healthy older adults (ACTIVE, N = 2802). Accounting for a general fluid cognitive functioning factor (comprised of the shared variance among measures of episodic memory, speed, and reasoning) attenuated the relationship between objective memory performance and subjective memory beliefs in both samples. Moreover, the general cognitive functioning factor was the strongest predictor of memory beliefs in both samples. These findings are consistent with the notion that dispositional memory beliefs may reflect perceptions of cognition more broadly. This may be one reason why memory beliefs have broad predictive validity for interventions that target fluid cognitive ability.
Common neural substrates for visual working memory and attention.

PubMed

Mayer, Jutta S; Bittner, Robert A; Nikolić, Danko; Bledowski, Christoph; Goebel, Rainer; Linden, David E J

2007-06-01

Humans are severely limited in their ability to memorize visual information over short periods of time. Selective attention has been implicated as a limiting factor. Here we used functional magnetic resonance imaging to test the hypothesis that this limitation is due to common neural resources shared by visual working memory (WM) and selective attention. We combined visual search and delayed discrimination of complex objects and independently modulated the demands on selective attention and WM encoding. Participants were presented with a search array and performed easy or difficult visual search in order to encode one or three complex objects into visual WM. Overlapping activation for attention-demanding visual search and WM encoding was observed in distributed posterior and frontal regions. In the right prefrontal cortex and bilateral insula blood oxygen-level-dependent activation additively increased with increased WM load and attentional demand. Conversely, several visual, parietal and premotor areas showed overlapping activation for the two task components and were severely reduced in their WM load response under the condition with high attentional demand. Regions in the left prefrontal cortex were selectively responsive to WM load. Areas selectively responsive to high attentional demand were found within the right prefrontal and bilateral occipital cortex. These results indicate that encoding into visual WM and visual selective attention require to a high degree access to common neural resources. We propose that competition for resources shared by visual attention and WM encoding can limit processing capabilities in distributed posterior brain regions.
Method for prefetching non-contiguous data structures

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Brewster, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Takken, Todd E [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

2009-05-05

A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple perfecting for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefect rather than some other predictive algorithm. This enables hardware to effectively prefect memory access patterns that are non-contiguous, but repetitive.
Improving the performance of heterogeneous multi-core processors by modifying the cache coherence protocol

NASA Astrophysics Data System (ADS)

Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying

2017-05-01

In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.
A shared resource between declarative memory and motor memory.

PubMed

Keisler, Aysha; Shadmehr, Reza

2010-11-03

The neural systems that support motor adaptation in humans are thought to be distinct from those that support the declarative system. Yet, during motor adaptation changes in motor commands are supported by a fast adaptive process that has important properties (rapid learning, fast decay) that are usually associated with the declarative system. The fast process can be contrasted to a slow adaptive process that also supports motor memory, but learns gradually and shows resistance to forgetting. Here we show that after people stop performing a motor task, the fast motor memory can be disrupted by a task that engages declarative memory, but the slow motor memory is immune from this interference. Furthermore, we find that the fast/declarative component plays a major role in the consolidation of the slow motor memory. Because of the competitive nature of declarative and nondeclarative memory during consolidation, impairment of the fast/declarative component leads to improvements in the slow/nondeclarative component. Therefore, the fast process that supports formation of motor memory is not only neurally distinct from the slow process, but it shares critical resources with the declarative memory system.
A shared resource between declarative memory and motor memory

PubMed Central

Keisler, Aysha; Shadmehr, Reza

2010-01-01

The neural systems that support motor adaptation in humans are thought to be distinct from those that support the declarative system. Yet, during motor adaptation changes in motor commands are supported by a fast adaptive process that has important properties (rapid learning, fast decay) that are usually associated with the declarative system. The fast process can be contrasted to a slow adaptive process that also supports motor memory, but learns gradually and shows resistance to forgetting. Here we show that after people stop performing a motor task, the fast motor memory can be disrupted by a task that engages declarative memory, but the slow motor memory is immune from this interference. Furthermore, we find that the fast/declarative component plays a major role in the consolidation of the slow motor memory. Because of the competitive nature of declarative and non-declarative memory during consolidation, impairment of the fast/declarative component leads to improvements in the slow/non-declarative component. Therefore, the fast process that supports formation of motor memory is not only neurally distinct from the slow process, but it shares critical resources with the declarative memory system. PMID:21048140
Optimizing ROOT’s Performance Using C++ Modules

NASA Astrophysics Data System (ADS)

Vassilev, Vassil

2017-10-01

ROOT comes with a C++ compliant interpreter cling. Cling needs to understand the content of the libraries in order to interact with them. Exposing the full shared library descriptors to the interpreter at runtime translates into increased memory footprint. ROOT’s exploratory programming concepts allow implicit and explicit runtime shared library loading. It requires the interpreter to load the library descriptor. Re-parsing of descriptors’ content has a noticeable effect on the runtime performance. Present state-of-art lazy parsing technique brings the runtime performance to reasonable levels but proves to be fragile and can introduce correctness issues. An elegant solution is to load information from the descriptor lazily and in a non-recursive way. The LLVM community advances its C++ Modules technology providing an io-efficient, on-disk representation capable to reduce build times and peak memory usage. The feature is standardized as a C++ technical specification. C++ Modules are a flexible concept, which can be employed to match CMS and other experiments’ requirement for ROOT: to optimize both runtime memory usage and performance. Cling technically “inherits” the feature, however tweaking it to ROOT scale and beyond is a complex endeavor. The paper discusses the status of the C++ Modules in the context of ROOT, supported by few preliminary performance results. It shows a step-by-step migration plan and describes potential challenges which could appear.
‘tripleint_cc’: A program for 2-centre variational leptonic Coulomb potential matrix elements using Hylleraas-type trial functions, with a performance optimization study

NASA Astrophysics Data System (ADS)

Plummer, M.; Armour, E. A. G.; Todd, A. C.; Franklin, C. P.; Cooper, J. N.

2009-12-01

We present a program used to calculate intricate three-particle integrals for variational calculations of solutions to the leptonic Schrödinger equation with two nuclear centres in which inter-leptonic distances (electron-electron and positron-electron) are included directly in the trial functions. The program has been used so far in calculations of He-H¯ interactions and positron H 2 scattering, however the precisely defined integrals are applicable to other situations. We include a summary discussion of how the program has been optimized from a 'legacy'-type code to a more modern high-performance code with a performance improvement factor of up to 1000. Program summaryProgram title: tripleint.cc Catalogue identifier: AEEV_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEEV_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 12 829 No. of bytes in distributed program, including test data, etc.: 91 798 Distribution format: tar.gz Programming language: Fortran 95 (fixed format) Computer: Modern PC (tested on AMD processor) [1], IBM Power5 [2] Cray XT4 [3], similar Operating system: Red Hat Linux [1], IBM AIX [2], UNICOS [3] Has the code been vectorized or parallelized?: Serial (multi-core shared memory may be needed for some large jobs) RAM: Dependent on parameter sizes and option to use intermediate I/O. Estimates for practical use: 0.5-2 GBytes (with intermediate I/O); 1-4 GBytes (all-memory: the preferred option). Classification: 2.4, 2.6, 2.7, 2.9, 16.5, 16.10, 20 Nature of problem: The 'tripleint.cc' code evaluates three-particle integrals needed in certain variational (in particular: Rayleigh-Ritz and generalized-Kohn) matrix elements for solution of the Schrödinger equation with two fixed centres (the solutions may then be used in subsequent dynamic nuclear calculations). Specifically the integrals are defined by Eq. (16) in the main text and contain terms proportional to r×r/r,i≠j,i≠k,j≠k, with r the distance between leptons i and j. The article also briefly describes the performance optimizations used to increase the speed of evaluation of the integrals enough to allow detailed testing and mapping of the effect of varying non-linear parameters in the variational trial functions. Solution method: Each integral is solved using prolate spheroidal coordinates and series expansions (with cut-offs) of the many-lepton expressions. 1-d integrals and sub-integrals are solved analytically by various means (the program automatically chooses the most accurate of the available methods for each set of parameters and function arguments), while two of the three integrations over the prolate spheroidal coordinates ' λ' are carried out numerically. Many similar integrals with identical non-linear variational parameters may be calculated with one call of the code. Restrictions: There are limits to the number of points for the numerical integrations, to the cut-off variable itaumax for the many-lepton series expansions, and to the maximum powers of Slater-like input functions. For runs near the limit of the cut-off variable and with certain small-magnitude values of variational non-linear parameters, the code can require large amounts of memory (an option using some intermediate I/O is included to offset this). Unusual features: In addition to the program, we also present a summary description of the techniques and ideology used to optimize the code, together with accuracy tests and indications of performance improvement. Running time: The test runs take 1-15 minutes on HPCx [2] as indicated in Section 5 of the main text. A practical run with 729 integrals, 40 quadrature points per dimension and itaumax = 8 took 150 minutes on a PC (e.g., [1]): a similar run with 'medium' accuracy, e.g. for parameter optimization (see Section 2 of the main text), with 30 points per dimension and itaumax = 6 took 35 minutes. References:PC: Memory: 2.72 GB, CPU: AMD Opteron 246 dual-core, 2×2 GHz, OS: GNU/Linux, kernel: Linux 2.6.9-34.0.2.ELsmp. HPCx, IBM eServer 575 running IBM AIX, http://www.hpcx.ac.uk/ (visited May 2009). HECToR, CRAY XT4 running UNICOS/lc, http://www.hector.ac.uk/ (visited May 2009).
VIRTUAL FRAME BUFFER INTERFACE

NASA Technical Reports Server (NTRS)

Wolfe, T. L.

1994-01-01

Large image processing systems use multiple frame buffers with differing architectures and vendor supplied user interfaces. This variety of architectures and interfaces creates software development, maintenance, and portability problems for application programs. The Virtual Frame Buffer Interface program makes all frame buffers appear as a generic frame buffer with a specified set of characteristics, allowing programmers to write code which will run unmodified on all supported hardware. The Virtual Frame Buffer Interface converts generic commands to actual device commands. The virtual frame buffer consists of a definition of capabilities and FORTRAN subroutines that are called by application programs. The virtual frame buffer routines may be treated as subroutines, logical functions, or integer functions by the application program. Routines are included that allocate and manage hardware resources such as frame buffers, monitors, video switches, trackballs, tablets and joysticks; access image memory planes; and perform alphanumeric font or text generation. The subroutines for the various "real" frame buffers are in separate VAX/VMS shared libraries allowing modification, correction or enhancement of the virtual interface without affecting application programs. The Virtual Frame Buffer Interface program was developed in FORTRAN 77 for a DEC VAX 11/780 or a DEC VAX 11/750 under VMS 4.X. It supports ADAGE IK3000, DEANZA IP8500, Low Resolution RAMTEK 9460, and High Resolution RAMTEK 9460 Frame Buffers. It has a central memory requirement of approximately 150K. This program was developed in 1985.
Simulation of n-qubit quantum systems. I. Quantum registers and quantum gates

NASA Astrophysics Data System (ADS)

Radtke, T.; Fritzsche, S.

2005-12-01

During recent years, quantum computations and the study of n-qubit quantum systems have attracted a lot of interest, both in theory and experiment. Apart from the promise of performing quantum computations, however, these investigations also revealed a great deal of difficulties which still need to be solved in practice. In quantum computing, unitary and non-unitary quantum operations act on a given set of qubits to form (entangled) states, in which the information is encoded by the overall system often referred to as quantum registers. To facilitate the simulation of such n-qubit quantum systems, we present the FEYNMAN program to provide all necessary tools in order to define and to deal with quantum registers and quantum operations. Although the present version of the program is restricted to unitary transformations, it equally supports—whenever possible—the representation of the quantum registers both, in terms of their state vectors and density matrices. In addition to the composition of two or more quantum registers, moreover, the program also supports their decomposition into various parts by applying the partial trace operation and the concept of the reduced density matrix. Using an interactive design within the framework of MAPLE, therefore, we expect the FEYNMAN program to be helpful not only for teaching the basic elements of quantum computing but also for studying their physical realization in the future. Program summaryTitle of program:FEYNMAN Catalogue number:ADWE Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADWE Program obtainable from:CPC Program Library, Queen's University of Belfast, N. Ireland Licensing provisions:None Computers for which the program is designed:All computers with a license of the computer algebra system MAPLE [Maple is a registered trademark of Waterlo Maple Inc.] Operating systems or monitors under which the program has been tested:Linux, MS Windows XP Programming language used:MAPLE 9.5 (but should be compatible with 9.0 and 8.0, too) Memory and time required to execute with typical data:Storage and time requirements critically depend on the number of qubits, n, in the quantum registers due to the exponential increase of the associated Hilbert space. In particular, complex algebraic operations may require large amounts of memory even for small qubit numbers. However, most of the standard commands (see Section 4 for simple examples) react promptly for up to five qubits on a normal single-processor machine ( ⩾1GHz with 512 MB memory) and use less than 10 MB memory. No. of lines in distributed program, including test data, etc.: 8864 No. of bytes in distributed program, including test data, etc.: 493 182 Distribution format: tar.gz Nature of the physical problem:During the last decade, quantum computing has been found to provide a revolutionary new form of computation. The algorithms by Shor [P.W. Shor, SIAM J. Sci. Statist. Comput. 26 (1997) 1484] and Grover [L.K. Grover, Phys. Rev. Lett. 79 (1997) 325. [2
Multiprogramming performance degradation - Case study on a shared memory multiprocessor

NASA Technical Reports Server (NTRS)

Dimpsey, R. T.; Iyer, R. K.

1989-01-01

The performance degradation due to multiprogramming overhead is quantified for a parallel-processing machine. Measurements of real workloads were taken, and it was found that there is a moderate correlation between the completion time of a program and the amount of system overhead measured during program execution. Experiments in controlled environments were then conducted to calculate a lower bound on the performance degradation of parallel jobs caused by multiprogramming overhead. The results show that the multiprogramming overhead of parallel jobs consumes at least 4 percent of the processor time. When two or more serial jobs are introduced into the system, this amount increases to 5.3 percent
Parallel computation with the force

NASA Technical Reports Server (NTRS)

Jordan, H. F.

1985-01-01

A methodology, called the force, supports the construction of programs to be executed in parallel by a force of processes. The number of processes in the force is unspecified, but potentially very large. The force idea is embodied in a set of macros which produce multiproceossor FORTRAN code and has been studied on two shared memory multiprocessors of fairly different character. The method has simplified the writing of highly parallel programs within a limited class of parallel algorithms and is being extended to cover a broader class. The individual parallel constructs which comprise the force methodology are discussed. Of central concern are their semantics, implementation on different architectures and performance implications.
Shared Representations in Language Processing and Verbal Short-Term Memory: The Case of Grammatical Gender

ERIC Educational Resources Information Center

Schweppe, Judith; Rummer, Ralf

2007-01-01

The general idea of language-based accounts of short-term memory is that retention of linguistic materials is based on representations within the language processing system. In the present sentence recall study, we address the question whether the assumption of shared representations holds for morphosyntactic information (here: grammatical gender…
The Precategorical Nature of Visual Short-Term Memory

ERIC Educational Resources Information Center

Quinlan, Philip T.; Cohen, Dale J.

2016-01-01

We conducted a series of recognition experiments that assessed whether visual short-term memory (VSTM) is sensitive to shared category membership of to-be-remembered (tbr) images of common objects. In Experiment 1 some of the tbr items shared the same basic level category (e.g., hand axe): Such items were no better retained than others. In the…
Fault tolerant onboard packet switch architecture for communication satellites: Shared memory per beam approach

NASA Technical Reports Server (NTRS)

Shalkhauser, Mary JO; Quintana, Jorge A.; Soni, Nitin J.

1994-01-01

The NASA Lewis Research Center is developing a multichannel communication signal processing satellite (MCSPS) system which will provide low data rate, direct to user, commercial communications services. The focus of current space segment developments is a flexible, high-throughput, fault tolerant onboard information switching processor. This information switching processor (ISP) is a destination-directed packet switch which performs both space and time switching to route user information among numerous user ground terminals. Through both industry study contracts and in-house investigations, several packet switching architectures were examined. A contention-free approach, the shared memory per beam architecture, was selected for implementation. The shared memory per beam architecture, fault tolerance insertion, implementation, and demonstration plans are described.
The performance of disk arrays in shared-memory database machines

NASA Technical Reports Server (NTRS)

Katz, Randy H.; Hong, Wei

1993-01-01

In this paper, we examine how disk arrays and shared memory multiprocessors lead to an effective method for constructing database machines for general-purpose complex query processing. We show that disk arrays can lead to cost-effective storage systems if they are configured from suitably small formfactor disk drives. We introduce the storage system metric data temperature as a way to evaluate how well a disk configuration can sustain its workload, and we show that disk arrays can sustain the same data temperature as a more expensive mirrored-disk configuration. We use the metric to evaluate the performance of disk arrays in XPRS, an operational shared-memory multiprocessor database system being developed at the University of California, Berkeley.
On-Demand Lectures Create an Effective Distributed Education Experience

ERIC Educational Resources Information Center

Lindsey, Stanley D.

2003-01-01

In this article, the author shares his experience teaching senior-level structural engineering courses at the Georgia Institute of Technology's Georgia Tech Regional Engineering Program. The program is a unique partnership of four universities--Georgia Tech, Savannah State University, Armstrong Atlantic State University, and Georgia Southern…
Internet Technology in Magnetic Resonance: A Common Gateway Interface Program for the World-Wide Web NMR Spectrometer

PubMed

Buszko; Buszko; Wang

1998-04-01

A custom-written Common Gateway Interface (CGI) program for remote control of an NMR spectrometer using a World Wide Web browser has been described. The program, running on a UNIX workstation, uses multiple processes to handle concurrent tasks of interacting with the user and with the spectrometer. The program's parent process communicates with the browser and sends out commands to the spectrometer; the child process is mainly responsible for data acquisition. Communication between the processes is via the shared memory mechanism. The WWW pages that have been developed for the system make use of the frames feature of web browsers. The CGI program provides an intuitive user interface to the NMR spectrometer, making, in effect, a complex system an easy-to-use Web appliance. Copyright 1998 Academic Press.
Emerald: an object-based language for distributed programming

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hutchinson, N.C.

1987-01-01

Distributed systems have become more common, however constructing distributed applications remains a very difficult task. Numerous operating systems and programming languages have been proposed that attempt to simplify the programming of distributed applications. Here a programing language called Emerald is presented that simplifies distributed programming by extending the concepts of object-based languages to the distributed environment. Emerald supports a single model of computation: the object. Emerald objects include private entities such as integers and Booleans, as well as shared, distributed entities such as compilers, directories, and entire file systems. Emerald objects may move between machines in the system, but objectmore » invocation is location independent. The uniform semantic model used for describing all Emerald objects makes the construction of distributed applications in Emerald much simpler than in systems where the differences in implementation between local and remote entities are visible in the language semantics. Emerald incorporates a type system that deals only with the specification of objects - ignoring differences in implementation. Thus, two different implementations of the same abstraction may be freely mixed.« less

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

DOE PAGES

Song, Fengguang; Dongarra, Jack

2014-10-01

Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, in this paper we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasksmore » without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Finally, our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs.« less
Parallel Implementation of Triangular Cellular Automata for Computing Two-Dimensional Elastodynamic Response on Arbitrary Domains

NASA Astrophysics Data System (ADS)

Leamy, Michael J.; Springer, Adam C.

In this research we report parallel implementation of a Cellular Automata-based simulation tool for computing elastodynamic response on complex, two-dimensional domains. Elastodynamic simulation using Cellular Automata (CA) has recently been presented as an alternative, inherently object-oriented technique for accurately and efficiently computing linear and nonlinear wave propagation in arbitrarily-shaped geometries. The local, autonomous nature of the method should lead to straight-forward and efficient parallelization. We address this notion on symmetric multiprocessor (SMP) hardware using a Java-based object-oriented CA code implementing triangular state machines (i.e., automata) and the MPI bindings written in Java (MPJ Express). We use MPJ Express to reconfigure our existing CA code to distribute a domain's automata to cores present on a dual quad-core shared-memory system (eight total processors). We note that this message passing parallelization strategy is directly applicable to computer clustered computing, which will be the focus of follow-on research. Results on the shared memory platform indicate nearly-ideal, linear speed-up. We conclude that the CA-based elastodynamic simulator is easily configured to run in parallel, and yields excellent speed-up on SMP hardware.
A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Song, Fengguang; Dongarra, Jack

Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, in this paper we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasksmore » without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Finally, our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs.« less
System For Research On Multiple-Arm Robots

NASA Technical Reports Server (NTRS)

Backes, Paul G.; Hayati, Samad; Tso, Kam S.; Hayward, Vincent

1991-01-01

Kali system of computer programs and equipment provides environment for research on distributed programming and distributed control of coordinated-multiple-arm robots. Suitable for telerobotics research involving sensing and execution of low level tasks. Software and configuration of hardware designed flexible so system modified easily to test various concepts in control and programming of robots, including multiple-arm control, redundant-arm control, shared control, traded control, force control, force/position hybrid control, design and integration of sensors, teleoperation, task-space description and control, methods of adaptive control, control of flexible arms, and human factors.
Optical memories in digital computing

NASA Technical Reports Server (NTRS)

Alford, C. O.; Gaylord, T. K.

1979-01-01

High capacity optical memories with relatively-high data-transfer rate and multiport simultaneous access capability may serve as basis for new computer architectures. Several computer structures that might profitably use memories are: a) simultaneous record-access system, b) simultaneously-shared memory computer system, and c) parallel digital processing structure.
Reader set encoding for directory of shared cache memory in multiprocessor system

DOEpatents

Ahn, Dnaiel; Ceze, Luis H.; Gara, Alan; Ohmacht, Martin; Xiaotong, Zhuang

2014-06-10

In a parallel processing system with speculative execution, conflict checking occurs in a directory lookup of a cache memory that is shared by all processors. In each case, the same physical memory address will map to the same set of that cache, no matter which processor originated that access. The directory includes a dynamic reader set encoding, indicating what speculative threads have read a particular line. This reader set encoding is used in conflict checking. A bitset encoding is used to specify particular threads that have read the line.
Semantic graphs and associative memories

NASA Astrophysics Data System (ADS)

Pomi, Andrés; Mizraji, Eduardo

2004-12-01

Graphs have been increasingly utilized in the characterization of complex networks from diverse origins, including different kinds of semantic networks. Human memories are associative and are known to support complex semantic nets; these nets are represented by graphs. However, it is not known how the brain can sustain these semantic graphs. The vision of cognitive brain activities, shown by modern functional imaging techniques, assigns renewed value to classical distributed associative memory models. Here we show that these neural network models, also known as correlation matrix memories, naturally support a graph representation of the stored semantic structure. We demonstrate that the adjacency matrix of this graph of associations is just the memory coded with the standard basis of the concept vector space, and that the spectrum of the graph is a code invariant of the memory. As long as the assumptions of the model remain valid this result provides a practical method to predict and modify the evolution of the cognitive dynamics. Also, it could provide us with a way to comprehend how individual brains that map the external reality, almost surely with different particular vector representations, are nevertheless able to communicate and share a common knowledge of the world. We finish presenting adaptive association graphs, an extension of the model that makes use of the tensor product, which provides a solution to the known problem of branching in semantic nets.
Insights on consciousness from taste memory research.

PubMed

Gallo, Milagros

2016-01-01

Taste research in rodents supports the relevance of memory in order to determine the content of consciousness by modifying both taste perception and later action. Associated with this issue is the fact that taste and visual modalities share anatomical circuits traditionally related to conscious memory. This challenges the view of taste memory as a type of non-declarative unconscious memory.
Investigating Ground Swarm Robotics Using Agent Based Simulation

DTIC Science & Technology

2006-12-01

Incorporation of virtual pheromones as a shared memory map is modeled as an additional capability that is found to enhance the robustness and reliability of the...virtual pheromones as a shared memory map is modeled as an additional capability that is found to enhance the robustness and reliability of the swarm... PHEROMONES .......................................... 42 1. Repel Friends under Inorganic SA.................................................. 45 2. Max
STEMsalabim: A high-performance computing cluster friendly code for scanning transmission electron microscopy image simulations of thin specimens.

PubMed

Oelerich, Jan Oliver; Duschek, Lennart; Belz, Jürgen; Beyer, Andreas; Baranovskii, Sergei D; Volz, Kerstin

2017-06-01

We present a new multislice code for the computer simulation of scanning transmission electron microscope (STEM) images based on the frozen lattice approximation. Unlike existing software packages, the code is optimized to perform well on highly parallelized computing clusters, combining distributed and shared memory architectures. This enables efficient calculation of large lateral scanning areas of the specimen within the frozen lattice approximation and fine-grained sweeps of parameter space. Copyright © 2017 Elsevier B.V. All rights reserved.
A Parallel Rendering Algorithm for MIMD Architectures

NASA Technical Reports Server (NTRS)

Crockett, Thomas W.; Orloff, Tobias

1991-01-01

Applications such as animation and scientific visualization demand high performance rendering of complex three dimensional scenes. To deliver the necessary rendering rates, highly parallel hardware architectures are required. The challenge is then to design algorithms and software which effectively use the hardware parallelism. A rendering algorithm targeted to distributed memory MIMD architectures is described. For maximum performance, the algorithm exploits both object-level and pixel-level parallelism. The behavior of the algorithm is examined both analytically and experimentally. Its performance for large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 shows increasing performance from 1 to 128 processors across a wide range of scene complexities. It is shown that minimal modifications to the algorithm will adapt it for use on shared memory architectures as well.
Efficient diagonalization of the sparse matrices produced within the framework of the UK R-matrix molecular codes

NASA Astrophysics Data System (ADS)

Galiatsatos, P. G.; Tennyson, J.

2012-11-01

The most time consuming step within the framework of the UK R-matrix molecular codes is that of the diagonalization of the inner region Hamiltonian matrix (IRHM). Here we present the method that we follow to speed up this step. We use shared memory machines (SMM), distributed memory machines (DMM), the OpenMP directive based parallel language, the MPI function based parallel language, the sparse matrix diagonalizers ARPACK and PARPACK, a variation for real symmetric matrices of the official coordinate sparse matrix format and finally a parallel sparse matrix-vector product (PSMV). The efficient application of the previous techniques rely on two important facts: the sparsity of the matrix is large enough (more than 98%) and in order to get back converged results we need a small only part of the matrix spectrum.
A parallel implementation of a multisensor feature-based range-estimation method

NASA Technical Reports Server (NTRS)

Suorsa, Raymond E.; Sridhar, Banavar

1993-01-01

There are many proposed vision based methods to perform obstacle detection and avoidance for autonomous or semi-autonomous vehicles. All methods, however, will require very high processing rates to achieve real time performance. A system capable of supporting autonomous helicopter navigation will need to extract obstacle information from imagery at rates varying from ten frames per second to thirty or more frames per second depending on the vehicle speed. Such a system will need to sustain billions of operations per second. To reach such high processing rates using current technology, a parallel implementation of the obstacle detection/ranging method is required. This paper describes an efficient and flexible parallel implementation of a multisensor feature-based range-estimation algorithm, targeted for helicopter flight, realized on both a distributed-memory and shared-memory parallel computer.
The role of similarity in updating numerical information in working memory: decomposing the numerical distance effect.

PubMed

Lendínez, Cristina; Pelegrina, Santiago; Lechuga, M Teresa

2014-01-01

The present study investigates the process of updating representations in working memory (WM) and how similarity between the information involved influences this process. In WM updating tasks, the similarity in terms of numerical distance between the number to be substituted and the new one facilitates the updating process. We aimed to disentangle the possible effect of two dimensions of similarity that may contribute to this numerical effect: numerical distance itself and common digits shared between the numbers involved. Three experiments were conducted in which different ranges of distances and the coincidence between the digits of the two numbers involved in updating were manipulated. Results showed that the two dimensions of similarity had an effect on updating times. The greater the similarity between the information maintained in memory and the new information that substituted it, the faster the updating. This is consistent both with the idea of distributed representations based on features, and with a selective updating process based on a feature overwriting mechanism. Thus, updating in WM can be understood as a selective substitution process influenced by similarity in which only certain parts of the representation stored in memory are changed.
A High Order, Locally-Adaptive Method for the Navier-Stokes Equations

NASA Astrophysics Data System (ADS)

Chan, Daniel

1998-11-01

I have extended the FOSLS method of Cai, Manteuffel and McCormick (1997) and implemented it within the framework of a spectral element formulation using the Legendre polynomial basis function. The FOSLS method solves the Navier-Stokes equations as a system of coupled first-order equations and provides the ellipticity that is needed for fast iterative matrix solvers like multigrid to operate efficiently. Each element is treated as an object and its properties are self-contained. Only C^0 continuity is imposed across element interfaces; this design allows local grid refinement and coarsening without the burden of having an elaborate data structure, since only information along element boundaries is needed. With the FORTRAN 90 programming environment, I can maintain a high computational efficiency by employing a hybrid parallel processing model. The OpenMP directives provides parallelism in the loop level which is executed in a shared-memory SMP and the MPI protocol allows the distribution of elements to a cluster of SMP's connected via a commodity network. This talk will provide timing results and a comparison with a second order finite difference method.
Exploiting multi-scale parallelism for large scale numerical modelling of laser wakefield accelerators

NASA Astrophysics Data System (ADS)

Fonseca, R. A.; Vieira, J.; Fiuza, F.; Davidson, A.; Tsung, F. S.; Mori, W. B.; Silva, L. O.

2013-12-01

A new generation of laser wakefield accelerators (LWFA), supported by the extreme accelerating fields generated in the interaction of PW-Class lasers and underdense targets, promises the production of high quality electron beams in short distances for multiple applications. Achieving this goal will rely heavily on numerical modelling to further understand the underlying physics and identify optimal regimes, but large scale modelling of these scenarios is computationally heavy and requires the efficient use of state-of-the-art petascale supercomputing systems. We discuss the main difficulties involved in running these simulations and the new developments implemented in the OSIRIS framework to address these issues, ranging from multi-dimensional dynamic load balancing and hybrid distributed/shared memory parallelism to the vectorization of the PIC algorithm. We present the results of the OASCR Joule Metric program on the issue of large scale modelling of LWFA, demonstrating speedups of over 1 order of magnitude on the same hardware. Finally, scalability to over ˜106 cores and sustained performance over ˜2 P Flops is demonstrated, opening the way for large scale modelling of LWFA scenarios.
Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Biswas, Rupak

1999-01-01

The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.
Centrally managed unified shared virtual address space

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wilkes, John

Systems, apparatuses, and methods for managing a unified shared virtual address space. A host may execute system software and manage a plurality of nodes coupled to the host. The host may send work tasks to the nodes, and for each node, the host may externally manage the node's view of the system's virtual address space. Each node may have a central processing unit (CPU) style memory management unit (MMU) with an internal translation lookaside buffer (TLB). In one embodiment, the host may be coupled to a given node via an input/output memory management unit (IOMMU) interface, where the IOMMU frontendmore » interface shares the TLB with the given node's MMU. In another embodiment, the host may control the given node's view of virtual address space via memory-mapped control registers.« less
Attention and Visuospatial Working Memory Share the Same Processing Resources

PubMed Central

Feng, Jing; Pratt, Jay; Spence, Ian

2012-01-01

Attention and visuospatial working memory (VWM) share very similar characteristics; both have the same upper bound of about four items in capacity and they recruit overlapping brain regions. We examined whether both attention and VWM share the same processing resources using a novel dual-task costs approach based on a load-varying dual-task technique. With sufficiently large loads on attention and VWM, considerable interference between the two processes was observed. A further load increase on either process produced reciprocal increases in interference on both processes, indicating that attention and VWM share common resources. More critically, comparison among four experiments on the reciprocal interference effects, as measured by the dual-task costs, demonstrates no significant contribution from additional processing other than the shared processes. These results support the notion that attention and VWM share the same processing resources. PMID:22529826
EM-ANIMATE - COMPUTER PROGRAM FOR DISPLAYING AND ANIMATING THE STEADY-STATE TIME-HARMONIC ELECTROMAGNETIC NEAR FIELD AND SURFACE-CURRENT SOLUTIONS

NASA Technical Reports Server (NTRS)

Hom, K. W.

1994-01-01

The EM-ANIMATE program is a specialized visualization program that displays and animates the near-field and surface-current solutions obtained from an electromagnetics program, in particular, that from MOM3D (LAR-15074). The EM-ANIMATE program is windows based and contains a user-friendly, graphical interface for setting viewing options, case selection, file manipulation, etc. EM-ANIMATE displays the field and surface-current magnitude as smooth shaded color fields (color contours) ranging from a minimum contour value to a maximum contour value for the fields and surface currents. The program can display either the total electric field or the scattered electric field in either time-harmonic animation mode or in the root mean square (RMS) average mode. The default setting is initially set to the minimum and maximum values within the field and surface current data and can be optionally set by the user. The field and surface-current value are animated by calculating and viewing the solution at user selectable radian time increments between 0 and 2pi. The surface currents can also be displayed in either time-harmonic animation mode or in RMS average mode. In RMS mode, the color contours do not vary with time, but show the constant time averaged field and surface-current magnitude solution. The electric field and surface-current directions can be displayed as scaled vector arrows which have a length proportional to the magnitude at each field grid point or surface node point. These vector properties can be viewed separately or concurrently with the field or surface-current magnitudes. Animation speed is improved by turning off the display of the vector arrows. In RMS modes, the direction vectors are still displayed as varying with time since the time averaged direction vectors would be zero length vectors. Other surface properties can optionally be viewed. These include the surface grid, the resistance value assigned to each element of the grid, and the power dissipation of each element which has an assigned resistance value. The EM-ANIMATE program will accept up to 10 different surface current cases each consisting of up to 20,000 node points and 10,000 triangle definitions and will animate one of these cases. The capability is used to compare surface-current distribution due to various initial excitation directions or electric field orientations. The program can accept up to 50 planes of field data consisting of a grid of 100 by 100 field points. These planes of data are user selectable and can be viewed individually or concurrently. With these preset limits, the program requires 55 megabytes of core memory to run. These limits can be changed in the header files to accommodate the available core memory of an individual workstation. An estimate of memory required can be made as follows: approximate memory in bytes equals (number of nodes times number of surfaces times 14 variables times bytes per word, typically 4 bytes per floating point) plus (number of field planes times number of nodes per plane times 21 variables times bytes per word). This gives the approximate memory size required to store the field and surface-current data. The total memory size is approximately 400,000 bytes plus the data memory size. The animation calculations are performed in real time at any user set time step. For Silicon Graphics Workstations that have multiple processors, this program has been optimized to perform these calculations on multiple processors to increase animation rates. The optimized program uses the SGI PFA (Power FORTRAN Accelerator) library. On single processor machines, the parallelization directives are seen as comments to the program and will have no effect on compilation or execution. EM-ANIMATE is written in FORTRAN 77 for implementation on SGI IRIS workstations running IRIX 3.0 or later. A minimum of 55Mb of RAM is required for execution of this program; however, the code may be modified to accommodate the available memory of an individual workstation. For program execution, twenty-four bit, double-buffered color capability is suggested, but not required. Sample input and output files and a sample executable are provided on the distribution medium. Electronic documentation is provided in PostScript format and in the form of IRIX man pages. The standard distribution medium for EM-ANIMATE is a .25 inch streaming magnetic IRIX tape cartridge in UNIX tar format. EM-ANIMATE is also available as part of a bundled package, COS-10048 that includes MOM3D, an IRIS program that produces electromagnetic near field and surface current solutions. This program was developed in 1993.

User-Defined Data Distributions in High-Level Programming Languages

NASA Technical Reports Server (NTRS)

Diaconescu, Roxana E.; Zima, Hans P.

2006-01-01

One of the characteristic features of today s high performance computing systems is a physically distributed memory. Efficient management of locality is essential for meeting key performance requirements for these architectures. The standard technique for dealing with this issue has involved the extension of traditional sequential programming languages with explicit message passing, in the context of a processor-centric view of parallel computation. This has resulted in complex and error-prone assembly-style codes in which algorithms and communication are inextricably interwoven. This paper presents a high-level approach to the design and implementation of data distributions. Our work is motivated by the need to improve the current parallel programming methodology by introducing a paradigm supporting the development of efficient and reusable parallel code. This approach is currently being implemented in the context of a new programming language called Chapel, which is designed in the HPCS project Cascade.
BSR: B-spline atomic R-matrix codes

NASA Astrophysics Data System (ADS)

Zatsarinny, Oleg

2006-02-01

BSR is a general program to calculate atomic continuum processes using the B-spline R-matrix method, including electron-atom and electron-ion scattering, and radiative processes such as bound-bound transitions, photoionization and polarizabilities. The calculations can be performed in LS-coupling or in an intermediate-coupling scheme by including terms of the Breit-Pauli Hamiltonian. New version program summaryTitle of program: BSR Catalogue identifier: ADWY Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADWY Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Computers on which the program has been tested: Microway Beowulf cluster; Compaq Beowulf cluster; DEC Alpha workstation; DELL PC Operating systems under which the new version has been tested: UNIX, Windows XP Programming language used: FORTRAN 95 Memory required to execute with typical data: Typically 256-512 Mwords. Since all the principal dimensions are allocatable, the available memory defines the maximum complexity of the problem No. of bits in a word: 8 No. of processors used: 1 Has the code been vectorized or parallelized?: no No. of lines in distributed program, including test data, etc.: 69 943 No. of bytes in distributed program, including test data, etc.: 746 450 Peripherals used: scratch disk store; permanent disk store Distribution format: tar.gz Nature of physical problem: This program uses the R-matrix method to calculate electron-atom and electron-ion collision processes, with options to calculate radiative data, photoionization, etc. The calculations can be performed in LS-coupling or in an intermediate-coupling scheme, with options to include Breit-Pauli terms in the Hamiltonian. Method of solution: The R-matrix method is used [P.G. Burke, K.A. Berrington, Atomic and Molecular Processes: An R-Matrix Approach, IOP Publishing, Bristol, 1993; P.G. Burke, W.D. Robb, Adv. At. Mol. Phys. 11 (1975) 143; K.A. Berrington, W.B. Eissner, P.H. Norrington, Comput. Phys. Comm. 92 (1995) 290].
Preconditioned implicit solvers for the Navier-Stokes equations on distributed-memory machines

NASA Technical Reports Server (NTRS)

Ajmani, Kumud; Liou, Meng-Sing; Dyson, Rodger W.

1994-01-01

The GMRES method is parallelized, and combined with local preconditioning to construct an implicit parallel solver to obtain steady-state solutions for the Navier-Stokes equations of fluid flow on distributed-memory machines. The new implicit parallel solver is designed to preserve the convergence rate of the equivalent 'serial' solver. A static domain-decomposition is used to partition the computational domain amongst the available processing nodes of the parallel machine. The SPMD (Single-Program Multiple-Data) programming model is combined with message-passing tools to develop the parallel code on a 32-node Intel Hypercube and a 512-node Intel Delta machine. The implicit parallel solver is validated for internal and external flow problems, and is found to compare identically with flow solutions obtained on a Cray Y-MP/8. A peak computational speed of 2300 MFlops/sec has been achieved on 512 nodes of the Intel Delta machine,k for a problem size of 1024 K equations (256 K grid points).
Does a House Divided Stand? Kinship and the Continuity of Shared Living Arrangements

PubMed Central

Glick, Jennifer E.; Van Hook, Jennifer

2011-01-01

Shared living arrangements can provide housing, economies of scale, and other instrumental support and may become an important resource in times of economic constraint. But the extent to which such living arrangements experience continuity or rapid change in composition is unclear. Previous research on extended-family households tended to focus on factors that trigger the onset of coresidence, including life course events or changes in health status and related economic needs. Relying on longitudinal data from 9,932 households in the Survey of Income and Program Participation (SIPP), the analyses demonstrate that the distribution of economic resources in the household also influences the continuity of shared living arrangements. The results suggest that multigenerational households of parents and adult children experience greater continuity in composition when one individual or couple has a disproportionate share of the economic resources in the household. Other coresidential households, those shared by other kin or nonkin, experience greater continuity when resources are more evenly distributed. PMID:22259218
System and method for memory allocation in a multiclass memory system

DOEpatents

Loh, Gabriel; Meswani, Mitesh; Ignatowski, Michael; Nutter, Mark

2016-06-28

A system for memory allocation in a multiclass memory system includes a processor coupleable to a plurality of memories sharing a unified memory address space, and a library store to store a library of software functions. The processor identifies a type of a data structure in response to a memory allocation function call to the library for allocating memory to the data structure. Using the library, the processor allocates portions of the data structure among multiple memories of the multiclass memory system based on the type of the data structure.
Efficient implementation of multidimensional fast fourier transform on a distributed-memory parallel multi-node computer

DOEpatents

Bhanot, Gyan V [Princeton, NJ; Chen, Dong [Croton-On-Hudson, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

2012-01-10

The present in invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The "all-to-all" re-distribution of array elements is further efficiently implemented in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.
Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer

DOEpatents

Bhanot, Gyan V [Princeton, NJ; Chen, Dong [Croton-On-Hudson, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

2008-01-01

The present in invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The "all-to-all" re-distribution of array elements is further efficiently implemented in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.
A Formal Model of Capacity Limits in Working Memory

ERIC Educational Resources Information Center

Oberauer, Klaus; Kliegl, Reinhold

2006-01-01

A mathematical model of working-memory capacity limits is proposed on the key assumption of mutual interference between items in working memory. Interference is assumed to arise from overwriting of features shared by these items. The model was fit to time-accuracy data of memory-updating tasks from four experiments using nonlinear mixed effect…
Ordering of guarded and unguarded stores for no-sync I/O

DOEpatents

Gara, Alan; Ohmacht, Martin

2013-06-25

A parallel computing system processes at least one store instruction. A first processor core issues a store instruction. A first queue, associated with the first processor core, stores the store instruction. A second queue, associated with a first local cache memory device of the first processor core, stores the store instruction. The first processor core updates first data in the first local cache memory device according to the store instruction. The third queue, associated with at least one shared cache memory device, stores the store instruction. The first processor core invalidates second data, associated with the store instruction, in the at least one shared cache memory. The first processor core invalidates third data, associated with the store instruction, in other local cache memory devices of other processor cores. The first processor core flushing only the first queue.
LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kurzak, Jakub; Luszczek, Pitior; Faverge, Mathieu

2012-03-01

LU factorization with partial pivoting is a canonical numerical procedure and the main component of the High Performance LINPACK benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
Neural Mechanisms of Interference Control Underlie the Relationship between Fluid Intelligence and Working Memory Span

ERIC Educational Resources Information Center

Burgess, Gregory C.; Gray, Jeremy R.; Conway, Andrew R. A.; Braver, Todd S.

2011-01-01

Fluid intelligence (gF) and working memory (WM) span predict success in demanding cognitive situations. Recent studies show that much of the variance in gF and WM span is shared, suggesting common neural mechanisms. This study provides a direct investigation of the degree to which shared variance in gF and WM span can be explained by neural…
International Review of Standards and Labeling Programs for Distribution Transformers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Letschert, Virginie; Scholand, Michael; Carreño, Ana MarÃa

Transmission and distribution (T&D) losses in electricity networks represent 8.5% of final energy consumption in the world. In Latin America, T&D losses range between 6% and 20% of final energy consumption, and represent 7% in Chile. Because approximately one-third of T&D losses take place in distribution transformers alone, there is significant potential to save energy and reduce costs and carbon emissions through policy intervention to increase distribution transformer efficiency. A large number of economies around the world have recognized the significant impact of addressing distribution losses and have implemented policies to support market transformation towards more efficient distribution transformers. Asmore » a result, there is considerable international experience to be shared and leveraged to inform countries interested in reducing distribution losses through policy intervention. The report builds upon past international studies of standards and labeling (S&L) programs for distribution transformers to present the current energy efficiency programs for distribution transformers around the world.« less
Sawja: Static Analysis Workshop for Java

NASA Astrophysics Data System (ADS)

Hubert, Laurent; Barré, Nicolas; Besson, Frédéric; Demange, Delphine; Jensen, Thomas; Monfort, Vincent; Pichardie, David; Turpin, Tiphaine

Static analysis is a powerful technique for automatic verification of programs but raises major engineering challenges when developing a full-fledged analyzer for a realistic language such as Java. Efficiency and precision of such a tool rely partly on low level components which only depend on the syntactic structure of the language and therefore should not be redesigned for each implementation of a new static analysis. This paper describes the Sawja library: a static analysis workshop fully compliant with Java 6 which provides OCaml modules for efficiently manipulating Java bytecode programs. We present the main features of the library, including i) efficient functional data-structures for representing a program with implicit sharing and lazy parsing, ii) an intermediate stack-less representation, and iii) fast computation and manipulation of complete programs. We provide experimental evaluations of the different features with respect to time, memory and precision.
7 CFR 1486.209 - How are program applications evaluated and approved?

Code of Federal Regulations, 2010 CFR

2010-01-01

... maintain U.S. market share; (ii) Marketing and distribution of value-added products, including new products... channels in emerging markets, including infrastructural impediments to U.S. exports; such studies may...
A Comparison of Three Programming Models for Adaptive Applications

NASA Technical Reports Server (NTRS)

Shan, Hong-Zhang; Singh, Jaswinder Pal; Oliker, Leonid; Biswa, Rupak; Kwak, Dochan (Technical Monitor)

2000-01-01

We study the performance and programming effort for two major classes of adaptive applications under three leading parallel programming models. We find that all three models can achieve scalable performance on the state-of-the-art multiprocessor machines. The basic parallel algorithms needed for different programming models to deliver their best performance are similar, but the implementations differ greatly, far beyond the fact of using explicit messages versus implicit loads/stores. Compared with MPI and SHMEM, CC-SAS (cache-coherent shared address space) provides substantial ease of programming at the conceptual and program orchestration level, which often leads to the performance gain. However it may also suffer from the poor spatial locality of physically distributed shared data on large number of processors. Our CC-SAS implementation of the PARMETIS partitioner itself runs faster than in the other two programming models, and generates more balanced result for our application.
A Direct Experience in a New Accountable Care Organization: Results, Challenges, and the Role of the Neurosurgeon.

PubMed

Kim, Dong H; Lloyd, Christopher; Fernandez, Douglas K; Spielman, Amanda; Bradshaw, David

2017-04-01

The passage of the Affordable Care Act saw the creation of Accountable Care Organizations (ACOs), a new approach to healthcare delivery moving from fee-for-service toward population health. This paper presents a case study of the Memorial Hermann ACO (MHACO), launched in response to the Medicare Shared Savings Program, with goals to align physician and hospital incentives, practice evidence-based medicine, develop care coordination, and increase efficiency. Building blocks included an affiliated primary care network, a clinical integration program (involving shared electronic medical record platforms and quality data reporting), and significant investments in information technology. Presented is the approach taken to form MHACO; the management structure, technology developed, and a 2-year experience. Incorporated in July 2012, the MHACO involved 22 000 Medicare patients. In 2015, Centers for Medicare and Medicaid Services released data showing a composite quality score between 80 and 85 (from a maximum 100) and nearly $53 million in total savings (or 11% of expected expenditure), making MHACO one of the most successful nationally.1 In fewer than 5 years, almost 500 ACOs have developed, and by some estimates, a quarter of Medicare patients are currently enrolled in an ACO. Although ACOs to date have focused on primary care, the future will increasingly involve specialists. At Memorial Hermann, neurosurgeons took an early role in forming collaborative partnerships with the hospital, and started programs that served as precursors to the ACO model. This paper ends with an overview of ACO development, likely changes going forward, and a discussion of the role of specialists in general, and of neurosurgeons in particular. Copyright © 2016 by the Congress of Neurological Surgeons.
Checkpointing Shared Memory Programs at the Application-level

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bronevetsky, G; Schulz, M; Szwed, P

2004-09-08

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart(CPR)-the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR. Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focusedmore » on message-passing programs. In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. The system has two components: (i)a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks. One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.« less
Teacher and Principal Diversity and the Representation of Students of Color in Gifted Programs: Evidence from National Data

ERIC Educational Resources Information Center

Grissom, Jason A.; Rodriguez, Luis A.; Kern, Emily C.

2017-01-01

Students of color are significantly underrepresented in gifted programs relative to their White peers. Drawing on political science research suggesting that public organizations more equitably distribute policy outputs when service providers share characteristics with their client populations, we investigate whether representation of students of…
Audience-tuning effects on memory: the role of shared reality.

PubMed

Echterhoff, Gerald; Higgins, E Tory; Groll, Stephan

2005-09-01

After tuning to an audience, communicators' own memories for the topic often reflect the biased view expressed in their messages. Three studies examined explanations for this bias. Memories for a target person were biased when feedback signaled the audience's successful identification of the target but not after failed identification (Experiment 1). Whereas communicators tuning to an in-group audience exhibited the bias, communicators tuning to an out-group audience did not (Experiment 2). These differences did not depend on communicators' mood but were mediated by communicators' trust in their audience's judgment about other people (Experiments 2 and 3). Message and memory were more closely associated for high than for low trusters. Apparently, audience-tuning effects depend on the communicators' experience of a shared reality.
Performance evaluation of throughput computing workloads using multi-core processors and graphics processors

NASA Astrophysics Data System (ADS)

Dave, Gaurav P.; Sureshkumar, N.; Blessy Trencia Lincy, S. S.

2017-11-01

Current trend in processor manufacturing focuses on multi-core architectures rather than increasing the clock speed for performance improvement. Graphic processors have become as commodity hardware for providing fast co-processing in computer systems. Developments in IoT, social networking web applications, big data created huge demand for data processing activities and such kind of throughput intensive applications inherently contains data level parallelism which is more suited for SIMD architecture based GPU. This paper reviews the architectural aspects of multi/many core processors and graphics processors. Different case studies are taken to compare performance of throughput computing applications using shared memory programming in OpenMP and CUDA API based programming.

Examining age-related shared variance between face cognition, vision, and self-reported physical health: a test of the common cause hypothesis for social cognition

PubMed Central

Olderbak, Sally; Hildebrandt, Andrea; Wilhelm, Oliver

2015-01-01

The shared decline in cognitive abilities, sensory functions (e.g., vision and hearing), and physical health with increasing age is well documented with some research attributing this shared age-related decline to a single common cause (e.g., aging brain). We evaluate the extent to which the common cause hypothesis predicts associations between vision and physical health with social cognition abilities specifically face perception and face memory. Based on a sample of 443 adults (17–88 years old), we test a series of structural equation models, including Multiple Indicator Multiple Cause (MIMIC) models, and estimate the extent to which vision and self-reported physical health are related to face perception and face memory through a common factor, before and after controlling for their fluid cognitive component and the linear effects of age. Results suggest significant shared variance amongst these constructs, with a common factor explaining some, but not all, of the shared age-related variance. Also, we found that the relations of face perception, but not face memory, with vision and physical health could be completely explained by fluid cognition. Overall, results suggest that a single common cause explains most, but not all age-related shared variance with domain specific aging mechanisms evident. PMID:26321998
Solutions and debugging for data consistency in multiprocessors with noncoherent caches

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bernstein, D.; Mendelson, B.; Breternitz, M. Jr.

1995-02-01

We analyze two important problems that arise in shared-memory multiprocessor systems. The stale data problem involves ensuring that data items in local memory of individual processors are current, independent of writes done by other processors. False sharing occurs when two processors have copies of the same shared data block but update different portions of the block. The false sharing problem involves guaranteeing that subsequent writes are properly combined. In modern architectures these problems are usually solved in hardware, by exploiting mechanisms for hardware controlled cache consistency. This leads to more expensive and nonscalable designs. Therefore, we are concentrating on softwaremore » methods for ensuring cache consistency that would allow for affordable and scalable multiprocessing systems. Unfortunately, providing software control is nontrivial, both for the compiler writer and for the application programmer. For this reason we are developing a debugging environment that will facilitate the development of compiler-based techniques and will help the programmer to tune his or her application using explicit cache management mechanisms. We extend the notion of a race condition for IBM Shared Memory System POWER/4, taking into consideration its noncoherent caches, and propose techniques for detection of false sharing problems. Identification of the stale data problem is discussed as well, and solutions are suggested.« less
QRAP: A numerical code for projected (Q)uasiparticle (RA)ndom (P)hase approximation

NASA Astrophysics Data System (ADS)

Samana, A. R.; Krmpotić, F.; Bertulani, C. A.

2010-06-01

A computer code for quasiparticle random phase approximation - QRPA and projected quasiparticle random phase approximation - PQRPA models of nuclear structure is explained in details. The residual interaction is approximated by a simple δ-force. An important application of the code consists in evaluating nuclear matrix elements involved in neutrino-nucleus reactions. As an example, cross sections for 56Fe and 12C are calculated and the code output is explained. The application to other nuclei and the description of other nuclear and weak decay processes are also discussed. Program summaryTitle of program: QRAP ( Quasiparticle RAndom Phase approximation) Computers: The code has been created on a PC, but also runs on UNIX or LINUX machines Operating systems: WINDOWS or UNIX Program language used: Fortran-77 Memory required to execute with typical data: 16 Mbytes of RAM memory and 2 MB of hard disk space No. of lines in distributed program, including test data, etc.: ˜ 8000 No. of bytes in distributed program, including test data, etc.: ˜ 256 kB Distribution format: tar.gz Nature of physical problem: The program calculates neutrino- and antineutrino-nucleus cross sections as a function of the incident neutrino energy, and muon capture rates, using the QRPA or PQRPA as nuclear structure models. Method of solution: The QRPA, or PQRPA, equations are solved in a self-consistent way for even-even nuclei. The nuclear matrix elements for the neutrino-nucleus interaction are treated as the beta inverse reaction of odd-odd nuclei as function of the transfer momentum. Typical running time: ≈ 5 min on a 3 GHz processor for Data set 1.
Static Memory Deduplication for Performance Optimization in Cloud Computing.

PubMed

Jia, Gangyong; Han, Guangjie; Wang, Hao; Yang, Xuan

2017-04-27

In a cloud computing environment, the number of virtual machines (VMs) on a single physical server and the number of applications running on each VM are continuously growing. This has led to an enormous increase in the demand of memory capacity and subsequent increase in the energy consumption in the cloud. Lack of enough memory has become a major bottleneck for scalability and performance of virtualization interfaces in cloud computing. To address this problem, memory deduplication techniques which reduce memory demand through page sharing are being adopted. However, such techniques suffer from overheads in terms of number of online comparisons required for the memory deduplication. In this paper, we propose a static memory deduplication (SMD) technique which can reduce memory capacity requirement and provide performance optimization in cloud computing. The main innovation of SMD is that the process of page detection is performed offline, thus potentially reducing the performance cost, especially in terms of response time. In SMD, page comparisons are restricted to the code segment, which has the highest shared content. Our experimental results show that SMD efficiently reduces memory capacity requirement and improves performance. We demonstrate that, compared to other approaches, the cost in terms of the response time is negligible.
Static Memory Deduplication for Performance Optimization in Cloud Computing

PubMed Central

Jia, Gangyong; Han, Guangjie; Wang, Hao; Yang, Xuan

2017-01-01

In a cloud computing environment, the number of virtual machines (VMs) on a single physical server and the number of applications running on each VM are continuously growing. This has led to an enormous increase in the demand of memory capacity and subsequent increase in the energy consumption in the cloud. Lack of enough memory has become a major bottleneck for scalability and performance of virtualization interfaces in cloud computing. To address this problem, memory deduplication techniques which reduce memory demand through page sharing are being adopted. However, such techniques suffer from overheads in terms of number of online comparisons required for the memory deduplication. In this paper, we propose a static memory deduplication (SMD) technique which can reduce memory capacity requirement and provide performance optimization in cloud computing. The main innovation of SMD is that the process of page detection is performed offline, thus potentially reducing the performance cost, especially in terms of response time. In SMD, page comparisons are restricted to the code segment, which has the highest shared content. Our experimental results show that SMD efficiently reduces memory capacity requirement and improves performance. We demonstrate that, compared to other approaches, the cost in terms of the response time is negligible. PMID:28448434
Simulation of n-qubit quantum systems. III. Quantum operations

NASA Astrophysics Data System (ADS)

Radtke, T.; Fritzsche, S.

2007-05-01

During the last decade, several quantum information protocols, such as quantum key distribution, teleportation or quantum computation, have attracted a lot of interest. Despite the recent success and research efforts in quantum information processing, however, we are just at the beginning of understanding the role of entanglement and the behavior of quantum systems in noisy environments, i.e. for nonideal implementations. Therefore, in order to facilitate the investigation of entanglement and decoherence in n-qubit quantum registers, here we present a revised version of the FEYNMAN program for working with quantum operations and their associated (Jamiołkowski) dual states. Based on the implementation of several popular decoherence models, we provide tools especially for the quantitative analysis of quantum operations. Apart from the implementation of different noise models, the current program extension may help investigate the fragility of many quantum states, one of the main obstacles in realizing quantum information protocols today. Program summaryTitle of program: Feynman Catalogue identifier: ADWE_v3_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADWE_v3_0 Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Licensing provisions: None Operating systems: Any system that supports MAPLE; tested under Microsoft Windows XP, SuSe Linux 10 Program language used:MAPLE 10 Typical time and memory requirements: Most commands that act upon quantum registers with five or less qubits take ⩽10 seconds of processor time (on a Pentium 4 processor with ⩾2 GHz or equivalent) and 5-20 MB of memory. Especially when working with symbolic expressions, however, the memory and time requirements critically depend on the number of qubits in the quantum registers, owing to the exponential dimension growth of the associated Hilbert space. For example, complex (symbolic) noise models (with several Kraus operators) for multi-qubit systems often result in very large symbolic expressions that dramatically slow down the evaluation of measures or other quantities. In these cases, MAPLE's assume facility sometimes helps to reduce the complexity of symbolic expressions, but often only numerical evaluation is possible. Since the complexity of the FEYNMAN commands is very different, no general scaling law for the CPU time and memory usage can be given. No. of bytes in distributed program including test data, etc.: 799 265 No. of lines in distributed program including test data, etc.: 18 589 Distribution format: tar.gz Reasons for new version: While the previous program versions were designed mainly to create and manipulate the state of quantum registers, the present extension aims to support quantum operations as the essential ingredient for studying the effects of noisy environments. Does this version supersede the previous version: Yes Nature of the physical problem: Today, entanglement is identified as the essential resource in virtually all aspects of quantum information theory. In most practical implementations of quantum information protocols, however, decoherence typically limits the lifetime of entanglement. It is therefore necessary and highly desirable to understand the evolution of entanglement in noisy environments. Method of solution: Using the computer algebra system MAPLE, we have developed a set of procedures that support the definition and manipulation of n-qubit quantum registers as well as (unitary) logic gates and (nonunitary) quantum operations that act on the quantum registers. The provided hierarchy of commands can be used interactively in order to simulate and analyze the evolution of n-qubit quantum systems in ideal and nonideal quantum circuits.
Memory-efficient dynamic programming backtrace and pairwise local sequence alignment.

PubMed

Newberg, Lee A

2008-08-15

A backtrace through a dynamic programming algorithm's intermediate results in search of an optimal path, or to sample paths according to an implied probability distribution, or as the second stage of a forward-backward algorithm, is a task of fundamental importance in computational biology. When there is insufficient space to store all intermediate results in high-speed memory (e.g. cache) existing approaches store selected stages of the computation, and recompute missing values from these checkpoints on an as-needed basis. Here we present an optimal checkpointing strategy, and demonstrate its utility with pairwise local sequence alignment of sequences of length 10,000. Sample C++-code for optimal backtrace is available in the Supplementary Materials. Supplementary data is available at Bioinformatics online.
Advanced Development of Certified OS Kernels

DTIC Science & Technology

2015-06-01

It provides an infrastructure to map a physical page into multiple processes’ page maps in different address spaces. Their ownership mechanism ensures...of their shared memory infrastructure . Trap module The trap module specifies the behaviors of exception handlers and mCertiKOS system calls. In...layers), 1 pm for the shared memory infrastructure (3 layers), 3.5 pm for the thread management (10 layers), 1 pm for the process management (4 layers
6 DOF Nonlinear AUV Simulation Toolbox

DTIC Science & Technology

1997-01-01

is to supply a flexible 3D -simulation platform for motion visualization, in-lab debugging and testing of mission-specific strategies as well as those...Explorer are modular designed [Smith] in order to cut time and cost for vehicle recontlguration. A flexible 3D -simulation platform is desired to... 3D models. Current implemented modules include a nonlinear dynamic model for the OEX, shared memory and semaphore manager tools, shared memory monitor
A cache-aided multiprocessor rollback recovery scheme

NASA Technical Reports Server (NTRS)

Wu, Kun-Lung; Fuchs, W. Kent

1989-01-01

This paper demonstrates how previous uniprocessor cache-aided recovery schemes can be applied to multiprocessor architectures, for recovering from transient processor failures, utilizing private caches and a global shared memory. As with cache-aided uniprocessor recovery, the multiprocessor cache-aided recovery scheme of this paper can be easily integrated into standard bus-based snoopy cache coherence protocols. A consistent shared memory state is maintained without the necessity of global check-pointing.
Targeted Memory Reactivation during Sleep Adaptively Promotes the Strengthening or Weakening of Overlapping Memories.

PubMed

Oyarzún, Javiera P; Morís, Joaquín; Luque, David; de Diego-Balaguer, Ruth; Fuentemilla, Lluís

2017-08-09

System memory consolidation is conceptualized as an active process whereby newly encoded memory representations are strengthened through selective memory reactivation during sleep. However, our learning experience is highly overlapping in content (i.e., shares common elements), and memories of these events are organized in an intricate network of overlapping associated events. It remains to be explored whether and how selective memory reactivation during sleep has an impact on these overlapping memories acquired during awake time. Here, we test in a group of adult women and men the prediction that selective memory reactivation during sleep entails the reactivation of associated events and that this may lead the brain to adaptively regulate whether these associated memories are strengthened or pruned from memory networks on the basis of their relative associative strength with the shared element. Our findings demonstrate the existence of efficient regulatory neural mechanisms governing how complex memory networks are shaped during sleep as a function of their associative memory strength. SIGNIFICANCE STATEMENT Numerous studies have demonstrated that system memory consolidation is an active, selective, and sleep-dependent process in which only subsets of new memories become stabilized through their reactivation. However, the learning experience is highly overlapping in content and thus events are encoded in an intricate network of related memories. It remains to be explored whether and how memory reactivation has an impact on overlapping memories acquired during awake time. Here, we show that sleep memory reactivation promotes strengthening and weakening of overlapping memories based on their associative memory strength. These results suggest the existence of an efficient regulatory neural mechanism that avoids the formation of cluttered memory representation of multiple events and promotes stabilization of complex memory networks. Copyright © 2017 the authors 0270-6474/17/377748-11$15.00/0.
The N400 reveals how personal semantics is processed: Insights into the nature and organization of self-knowledge.

PubMed

Coronel, Jason C; Federmeier, Kara D

2016-04-01

There is growing recognition that some important forms of long-term memory are difficult to classify into one of the well-studied memory subtypes. One example is personal semantics. Like the episodes that are stored as part of one's autobiography, personal semantics is linked to an individual, yet, like general semantic memory, it is detached from a specific encoding context. Access to general semantics elicits an electrophysiological response known as the N400, which has been characterized across three decades of research; surprisingly, this response has not been fully examined in the context of personal semantics. In this study, we assessed responses to congruent and incongruent statements about people's own, personal preferences. We found that access to personal preferences elicited N400 responses, with congruency effects that were similar in latency and distribution to those for general semantic statements elicited from the same participants. These results suggest that the processing of personal and general semantics share important functional and neurobiological features. Copyright © 2016 Elsevier Ltd. All rights reserved.
[Artificial intelligence meeting neuropsychology. Semantic memory in normal and pathological aging].

PubMed

Aimé, Xavier; Charlet, Jean; Maillet, Didier; Belin, Catherine

2015-03-01

Artificial intelligence (IA) is the subject of much research, but also many fantasies. It aims to reproduce human intelligence in its learning capacity, knowledge storage and computation. In 2014, the Defense Advanced Research Projects Agency (DARPA) started the restoring active memory (RAM) program that attempt to develop implantable technology to bridge gaps in the injured brain and restore normal memory function to people with memory loss caused by injury or disease. In another IA's field, computational ontologies (a formal and shared conceptualization) try to model knowledge in order to represent a structured and unambiguous meaning of the concepts of a target domain. The aim of these structures is to ensure a consensual understanding of their meaning and a univariant use (the same concept is used by all to categorize the same individuals). The first representations of knowledge in the AI's domain are largely based on model tests of semantic memory. This one, as a component of long-term memory is the memory of words, ideas, concepts. It is the only declarative memory system that resists so remarkably to the effects of age. In contrast, non-specific cognitive changes may decrease the performance of elderly in various events and instead report difficulties of access to semantic representations that affect the semantics stock itself. Some dementias, like semantic dementia and Alzheimer's disease, are linked to alteration of semantic memory. We propose in this paper, using the computational ontologies model, a formal and relatively thin modeling, in the service of neuropsychology: 1) for the practitioner with decision support systems, 2) for the patient as cognitive prosthesis outsourced, and 3) for the researcher to study semantic memory.
A High Performance VLSI Computer Architecture For Computer Graphics

NASA Astrophysics Data System (ADS)

Chin, Chi-Yuan; Lin, Wen-Tai

1988-10-01

A VLSI computer architecture, consisting of multiple processors, is presented in this paper to satisfy the modern computer graphics demands, e.g. high resolution, realistic animation, real-time display etc.. All processors share a global memory which are partitioned into multiple banks. Through a crossbar network, data from one memory bank can be broadcasted to many processors. Processors are physically interconnected through a hyper-crossbar network (a crossbar-like network). By programming the network, the topology of communication links among processors can be reconfigurated to satisfy specific dataflows of different applications. Each processor consists of a controller, arithmetic operators, local memory, a local crossbar network, and I/O ports to communicate with other processors, memory banks, and a system controller. Operations in each processor are characterized into two modes, i.e. object domain and space domain, to fully utilize the data-independency characteristics of graphics processing. Special graphics features such as 3D-to-2D conversion, shadow generation, texturing, and reflection, can be easily handled. With the current high density interconnection (MI) technology, it is feasible to implement a 64-processor system to achieve 2.5 billion operations per second, a performance needed in most advanced graphics applications.
Hybrid-view programming of nuclear fusion simulation code in the PGAS parallel programming language XcalableMP

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi

Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensionalmore » gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.« less
Hybrid-view programming of nuclear fusion simulation code in the PGAS parallel programming language XcalableMP

DOE PAGES

Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi; ...

2016-06-01

Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensionalmore » gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.« less
Autobiographical memory functions of nostalgia in comparison to rumination and counterfactual thinking: similarity and uniqueness.

PubMed

Cheung, Wing-Yee; Wildschut, Tim; Sedikides, Constantine

2018-02-01

We compared and contrasted nostalgia with rumination and counterfactual thinking in terms of their autobiographical memory functions. Specifically, we assessed individual differences in nostalgia, rumination, and counterfactual thinking, which we then linked to self-reported functions or uses of autobiographical memory (Self-Regard, Boredom Reduction, Death Preparation, Intimacy Maintenance, Conversation, Teach/Inform, and Bitterness Revival). We tested which memory functions are shared and which are uniquely linked to nostalgia. The commonality among nostalgia, rumination, and counterfactual thinking resides in their shared positive associations with all memory functions: individuals who evinced a stronger propensity towards past-oriented thought (as manifested in nostalgia, rumination, and counterfactual thinking) reported greater overall recruitment of memories in the service of present functioning. The uniqueness of nostalgia resides in its comparatively strong positive associations with Intimacy Maintenance, Teach/Inform, and Self-Regard and weak association with Bitterness Revival. In all, nostalgia possesses a more positive functional signature than do rumination and counterfactual thinking.
Mnemonic convergence in social networks: The emergent properties of cognition at a collective level.

PubMed

Coman, Alin; Momennejad, Ida; Drach, Rae D; Geana, Andra

2016-07-19

The development of shared memories, beliefs, and norms is a fundamental characteristic of human communities. These emergent outcomes are thought to occur owing to a dynamic system of information sharing and memory updating, which fundamentally depends on communication. Here we report results on the formation of collective memories in laboratory-created communities. We manipulated conversational network structure in a series of real-time, computer-mediated interactions in fourteen 10-member communities. The results show that mnemonic convergence, measured as the degree of overlap among community members' memories, is influenced by both individual-level information-processing phenomena and by the conversational social network structure created during conversational recall. By studying laboratory-created social networks, we show how large-scale social phenomena (i.e., collective memory) can emerge out of microlevel local dynamics (i.e., mnemonic reinforcement and suppression effects). The social-interactionist approach proposed herein points to optimal strategies for spreading information in social networks and provides a framework for measuring and forging collective memories in communities of individuals.
Using memories to understand others: the role of episodic memory in theory of mind impairment in Alzheimer disease.

PubMed

Moreau, Noémie; Viallet, François; Champagne-Lavau, Maud

2013-09-01

Theory of mind (TOM) refers to the ability to infer one's own and other's mental states. Growing evidence highlighted the presence of impairment on the most complex TOM tasks in Alzheimer disease (AD). However, how TOM deficit is related to other cognitive dysfunctions and more specifically to episodic memory impairment - the prominent feature of this disease - is still under debate. Recent neuroanatomical findings have shown that remembering past events and inferring others' states of mind share the same cerebral network suggesting the two abilities share a common process .This paper proposes to review emergent evidence of TOM impairment in AD patients and to discuss the evidence of a relationship between TOM and episodic memory. We will discuss about AD patients' deficit in TOM being possibly related to their difficulties in recollecting memories of past social interactions. Copyright © 2013 Elsevier B.V. All rights reserved.
Mental time travel and the shaping of the human mind

PubMed Central

Suddendorf, Thomas; Addis, Donna Rose; Corballis, Michael C.

2009-01-01

Episodic memory, enabling conscious recollection of past episodes, can be distinguished from semantic memory, which stores enduring facts about the world. Episodic memory shares a core neural network with the simulation of future episodes, enabling mental time travel into both the past and the future. The notion that there might be something distinctly human about mental time travel has provoked ingenious attempts to demonstrate episodic memory or future simulation in non-human animals, but we argue that they have not yet established a capacity comparable to the human faculty. The evolution of the capacity to simulate possible future events, based on episodic memory, enhanced fitness by enabling action in preparation of different possible scenarios that increased present or future survival and reproduction chances. Human language may have evolved in the first instance for the sharing of past and planned future events, and, indeed, fictional ones, further enhancing fitness in social settings. PMID:19528013

Upsets in Erased Floating Gate Cells With High-Energy Protons

DOE PAGES

Gerardin, S.; Bagatin, M.; Paccagnella, A.; ...

2017-01-01

We discuss upsets in erased floating gate cells, due to large threshold voltage shifts, using statistical distributions collected on a large number of memory cells. The spread in the neutral threshold voltage appears to be too low to quantitatively explain the experimental observations in terms of simple charge loss, at least in SLC devices. The possibility that memories exposed to high energy protons and heavy ions exhibit negative charge transfer between programmed and erased cells is investigated, although the analysis does not provide conclusive support to this hypothesis.
Exploring Manycore Multinode Systems for Irregular Applications with FPGA Prototyping

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ceriani, Marco; Palermo, Gianluca; Secchi, Simone

We present a prototype of a multi-core architecture implemented on FPGA, designed to enable efficient execution of irregular applications on distributed shared memory machines, while maintaining high performance on regular workloads. The architecture is composed of off-the-shelf soft-core cores, local interconnection and memory interface, integrated with custom components that optimize it for irregular applications. It relies on three key elements: a global address space, multithreading, and fine-grained synchronization. Global addresses are scrambled to reduce the formation of network hot-spots, while the latency of the transactions is covered by integrating an hardware scheduler within the custom load/store buffers to take advantagemore » from the availability of multiple executions threads, increasing the efficiency in a transparent way to the application. We evaluated a dual node system irregular kernels showing scalability in the number of cores and threads.« less
A Framework for Parallel Unstructured Grid Generation for Complex Aerodynamic Simulations

NASA Technical Reports Server (NTRS)

Zagaris, George; Pirzadeh, Shahyar Z.; Chrisochoides, Nikos

2009-01-01

A framework for parallel unstructured grid generation targeting both shared memory multi-processors and distributed memory architectures is presented. The two fundamental building-blocks of the framework consist of: (1) the Advancing-Partition (AP) method used for domain decomposition and (2) the Advancing Front (AF) method used for mesh generation. Starting from the surface mesh of the computational domain, the AP method is applied recursively to generate a set of sub-domains. Next, the sub-domains are meshed in parallel using the AF method. The recursive nature of domain decomposition naturally maps to a divide-and-conquer algorithm which exhibits inherent parallelism. For the parallel implementation, the Master/Worker pattern is employed to dynamically balance the varying workloads of each task on the set of available CPUs. Performance results by this approach are presented and discussed in detail as well as future work and improvements.
Schemas in Problem Solving: An Integrated Model of Learning, Memory, and Instruction

DTIC Science & Technology

1992-01-01

article: "Hybrid Computation in Cognitive Science: Neural Networks and Symbols" (J. A. Anderson, 1990). And, Marvin Minsky echoes the sentiment in his...distributed processing: A handbook of models, programs, and exercises. Cambridge, MA: The MIT Press. Minsky , M. (1991). Logical versus analogical or symbolic
Cricket: A Mapped, Persistent Object Store

NASA Technical Reports Server (NTRS)

Shekita, Eugene; Zwilling, Michael

1996-01-01

This paper describes Cricket, a new database storage system that is intended to be used as a platform for design environments and persistent programming languages. Cricket uses the memory management primitives of the Mach operating system to provide the abstraction of a shared, transactional single-level store that can be directly accessed by user applications. In this paper, we present the design and motivation for Cricket. We also present some initial performance results which show that, for its intended applications, Cricket can provide better performance than a general-purpose database storage system.
The NGEE Arctic Data Archive -- Portal for Archiving and Distributing Data and Documentation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Boden, Thomas A; Palanisamy, Giri; Devarakonda, Ranjeet

2014-01-01

The Next-Generation Ecosystem Experiments (NGEE Arctic) project is committed to implementing a rigorous and high-quality data management program. The goal is to implement innovative and cost-effective guidelines and tools for collecting, archiving, and sharing data within the project, the larger scientific community, and the public. The NGEE Arctic web site is the framework for implementing these data management and data sharing tools. The open sharing of NGEE Arctic data among project researchers, the broader scientific community, and the public is critical to meeting the scientific goals and objectives of the NGEE Arctic project and critical to advancing the mission ofmore » the Department of Energy (DOE), Office of Science, Biological and Environmental (BER) Terrestrial Ecosystem Science (TES) program.« less
Modeling the glass transition of amorphous networks for shape-memory behavior

NASA Astrophysics Data System (ADS)

Xiao, Rui; Choi, Jinwoo; Lakhera, Nishant; Yakacki, Christopher M.; Frick, Carl P.; Nguyen, Thao D.

2013-07-01

In this paper, a thermomechanical constitutive model was developed for the time-dependent behaviors of the glass transition of amorphous networks. The model used multiple discrete relaxation processes to describe the distribution of relaxation times for stress relaxation, structural relaxation, and stress-activated viscous flow. A non-equilibrium thermodynamic framework based on the fictive temperature was introduced to demonstrate the thermodynamic consistency of the constitutive theory. Experimental and theoretical methods were developed to determine the parameters describing the distribution of stress and structural relaxation times and the dependence of the relaxation times on temperature, structure, and driving stress. The model was applied to study the effects of deformation temperatures and physical aging on the shape-memory behavior of amorphous networks. The model was able to reproduce important features of the partially constrained recovery response observed in experiments. Specifically, the model demonstrated a strain-recovery overshoot for cases programmed below Tg and subjected to a constant mechanical load. This phenomenon was not observed for materials programmed above Tg. Physical aging, in which the material was annealed for an extended period of time below Tg, shifted the activation of strain recovery to higher temperatures and increased significantly the initial recovery rate. For fixed-strain recovery, the model showed a larger overshoot in the stress response for cases programmed below Tg, which was consistent with previous experimental observations. Altogether, this work demonstrates how an understanding of the time-dependent behaviors of the glass transition can be used to tailor the temperature and deformation history of the shape-memory programming process to achieve more complex shape recovery pathways, faster recovery responses, and larger activation stresses.
A general purpose subroutine for fast fourier transform on a distributed memory parallel machine

NASA Technical Reports Server (NTRS)

Dubey, A.; Zubair, M.; Grosch, C. E.

1992-01-01

One issue which is central in developing a general purpose Fast Fourier Transform (FFT) subroutine on a distributed memory parallel machine is the data distribution. It is possible that different users would like to use the FFT routine with different data distributions. Thus, there is a need to design FFT schemes on distributed memory parallel machines which can support a variety of data distributions. An FFT implementation on a distributed memory parallel machine which works for a number of data distributions commonly encountered in scientific applications is presented. The problem of rearranging the data after computing the FFT is also addressed. The performance of the implementation on a distributed memory parallel machine Intel iPSC/860 is evaluated.
Intergenerational transmission of historical memories and social-distance attitudes in post-war second-generation Croatians.

PubMed

Svob, Connie; Brown, Norman R; Takšić, Vladimir; Katulić, Katarina; Žauhar, Valnea

2016-08-01

Intergenerational transmission of memory is a process by which biographical knowledge contributes to the construction of collective memory (representation of a shared past). We investigated the intergenerational transmission of war-related memories and social-distance attitudes in second-generation post-war Croatians. We compared 2 groups of young adults from (1) Eastern Croatia (extensively affected by the war) and (2) Western Croatia (affected relatively less by the war). Participants were asked to (a) recall the 10 most important events that occurred in one of their parents' lives, (b) estimate the calendar years of each, and (c) provide scale ratings on them. Additionally, (d) all participants completed a modified Bogardus Social Distance scale, as well as an (e) War Events Checklist for their parents' lives. There were several findings. First, approximately two-thirds of Eastern Croatians and one-half of Western Croatians reported war-related events from their parents' lives. Second, war-related memories impacted the second-generation's identity to a greater extent than did non-war-related memories; this effect was significantly greater in Eastern Croatians than in Western Croatians. Third, war-related events displayed markedly different mnemonic characteristics than non-war-related events. Fourth, the temporal distribution of events surrounding the war produced an upheaval bump, suggesting major transitions (e.g., war) contribute to the way collective memory is formed. And, finally, outright social ostracism and aggression toward out-groups were rarely expressed, independent of region. Nonetheless, social-distance scores were notably higher in Eastern Croatia than in Western Croatia.
A shared neural ensemble links distinct contextual memories encoded close in time

NASA Astrophysics Data System (ADS)

Cai, Denise J.; Aharoni, Daniel; Shuman, Tristan; Shobe, Justin; Biane, Jeremy; Song, Weilin; Wei, Brandon; Veshkini, Michael; La-Vu, Mimi; Lou, Jerry; Flores, Sergio E.; Kim, Isaac; Sano, Yoshitake; Zhou, Miou; Baumgaertel, Karsten; Lavi, Ayal; Kamata, Masakazu; Tuszynski, Mark; Mayford, Mark; Golshani, Peyman; Silva, Alcino J.

2016-06-01

Recent studies suggest that a shared neural ensemble may link distinct memories encoded close in time. According to the memory allocation hypothesis, learning triggers a temporary increase in neuronal excitability that biases the representation of a subsequent memory to the neuronal ensemble encoding the first memory, such that recall of one memory increases the likelihood of recalling the other memory. Here we show in mice that the overlap between the hippocampal CA1 ensembles activated by two distinct contexts acquired within a day is higher than when they are separated by a week. Several findings indicate that this overlap of neuronal ensembles links two contextual memories. First, fear paired with one context is transferred to a neutral context when the two contexts are acquired within a day but not across a week. Second, the first memory strengthens the second memory within a day but not across a week. Older mice, known to have lower CA1 excitability, do not show the overlap between ensembles, the transfer of fear between contexts, or the strengthening of the second memory. Finally, in aged mice, increasing cellular excitability and activating a common ensemble of CA1 neurons during two distinct context exposures rescued the deficit in linking memories. Taken together, these findings demonstrate that contextual memories encoded close in time are linked by directing storage into overlapping ensembles. Alteration of these processes by ageing could affect the temporal structure of memories, thus impairing efficient recall of related information.
Factor structure of overall autobiographical memory usage: the directive, self and social functions revisited.

PubMed

Rasmussen, Anne S; Habermas, Tilmann

2011-08-01

According to theory, autobiographical memory serves three broad functions of overall usage: directive, self, and social. However, there is evidence to suggest that the tripartite model may be better conceptualised in terms of a four-factor model with two social functions. In the present study we examined the two models in Danish and German samples, using the Thinking About Life Experiences Questionnaire (TALE; Bluck, Alea, Habermas, & Rubin, 2005), which measures the overall usage of the three functions generalised across concrete memories. Confirmatory factor analysis supported the four-factor model and rejected the theoretical three-factor model in both samples. The results are discussed in relation to cultural differences in overall autobiographical memory usage as well as sharing versus non-sharing aspects of social remembering.
The application of a sparse, distributed memory to the detection, identification and manipulation of physical objects

NASA Technical Reports Server (NTRS)

Kanerva, P.

1986-01-01

To determine the relation of the sparse, distributed memory to other architectures, a broad review of the literature was made. The memory is called a pattern memory because they work with large patterns of features (high-dimensional vectors). A pattern is stored in a pattern memory by distributing it over a large number of storage elements and by superimposing it over other stored patterns. A pattern is retrieved by mathematical or statistical reconstruction from the distributed elements. Three pattern memories are discussed.
Distributed computing for membrane-based modeling of action potential propagation.

PubMed

Porras, D; Rogers, J M; Smith, W M; Pollard, A E

2000-08-01

Action potential propagation simulations with physiologic membrane currents and macroscopic tissue dimensions are computationally expensive. We, therefore, analyzed distributed computing schemes to reduce execution time in workstation clusters by parallelizing solutions with message passing. Four schemes were considered in two-dimensional monodomain simulations with the Beeler-Reuter membrane equations. Parallel speedups measured with each scheme were compared to theoretical speedups, recognizing the relationship between speedup and code portions that executed serially. A data decomposition scheme based on total ionic current provided the best performance. Analysis of communication latencies in that scheme led to a load-balancing algorithm in which measured speedups at 89 +/- 2% and 75 +/- 8% of theoretical speedups were achieved in homogeneous and heterogeneous clusters of workstations. Speedups in this scheme with the Luo-Rudy dynamic membrane equations exceeded 3.0 with eight distributed workstations. Cluster speedups were comparable to those measured during parallel execution on a shared memory machine.
Experimental evaluation of multiprocessor cache-based error recovery

NASA Technical Reports Server (NTRS)

Janssens, Bob; Fuchs, W. K.

1991-01-01

Several variations of cache-based checkpointing for rollback error recovery in shared-memory multiprocessors have been recently developed. By modifying the cache replacement policy, these techniques use the inherent redundancy in the memory hierarchy to periodically checkpoint the computation state. Three schemes, different in the manner in which they avoid rollback propagation, are evaluated. By simulation with address traces from parallel applications running on an Encore Multimax shared-memory multiprocessor, the performance effect of integrating the recovery schemes in the cache coherence protocol are evaluated. The results indicate that the cache-based schemes can provide checkpointing capability with low performance overhead but uncontrollable high variability in the checkpoint interval.
Performing a local reduction operation on a parallel computer

DOEpatents

Blocksome, Michael A; Faraj, Daniel A

2013-06-04

A parallel computer including compute nodes, each including two reduction processing cores, a network write processing core, and a network read processing core, each processing core assigned an input buffer. Copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory; copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory; copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; and locally reducing in parallel by the reduction processing cores: the contents of the reduction processing core's input buffer; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer.
Performing a local reduction operation on a parallel computer

DOEpatents

Blocksome, Michael A.; Faraj, Daniel A.

2012-12-11

A parallel computer including compute nodes, each including two reduction processing cores, a network write processing core, and a network read processing core, each processing core assigned an input buffer. Copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory; copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory; copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; and locally reducing in parallel by the reduction processing cores: the contents of the reduction processing core's input buffer; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer.
Improving security of the ping-pong protocol

NASA Astrophysics Data System (ADS)

Zawadzki, Piotr

2013-01-01

A security layer for the asymptotically secure ping-pong protocol is proposed and analyzed in the paper. The operation of the improvement exploits inevitable errors introduced by the eavesdropping in the control and message modes. Its role is similar to the privacy amplification algorithms known from the quantum key distribution schemes. Messages are processed in blocks which guarantees that an eavesdropper is faced with a computationally infeasible problem as long as the system parameters are within reasonable limits. The introduced additional information preprocessing does not require quantum memory registers and confidential communication is possible without prior key agreement or some shared secret.
Memory loss

MedlinePlus

... this page: //medlineplus.gov/ency/article/003257.htm Memory loss To use the sharing features on this ... Bethesda, MD 20894 U.S. Department of Health and Human Services National Institutes of Health Page last updated: ...
SCELib2: the new revision of SCELib, the parallel computational library of molecular properties in the single center approach

NASA Astrophysics Data System (ADS)

Sanna, N.; Morelli, G.

2004-09-01

In this paper we present the new version of the SCELib program (CPC Catalogue identifier ADMG) a full numerical implementation of the Single Center Expansion (SCE) method. The physics involved is that of producing the SCE description of molecular electronic densities, of molecular electrostatic potentials and of molecular perturbed potentials due to a point negative or positive charge. This new revision of the program has been optimized to run in serial as well as in parallel execution mode, to support a larger set of molecular symmetries and to permit the restart of long-lasting calculations. To measure the performance of this new release, a comparative study has been carried out on the most powerful computing architectures in serial and parallel runs. The results of the calculations reported in this paper refer to real cases medium to large molecular systems and they are reported in full details to benchmark at best the parallel architectures the new SCELib code will run on. Program summaryTitle of program: SCELib2 Catalogue identifier: ADGU Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADGU Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Reference to previous versions: Comput. Phys. Commun. 128 (2) (2000) 139 (CPC catalogue identifier: ADMG) Does the new version supersede the original program?: Yes Computer for which the program is designed and others on which it has been tested: HP ES45 and rx2600, SUN ES4500, IBM SP and any single CPU workstation based on Alpha, SPARC, POWER, Itanium2 and X86 processors Installations: CASPUR, local Operating systems under which the program has been tested: HP Tru64 V5.X, SUNOS V5.8, IBM AIX V5.X, Linux RedHat V8.0 Programming language used: C Memory required to execute with typical data: 10 Mwords. Up to 2000 Mwords depending on the molecular system and runtime parameters No. of bits in a word: 64 No. of processors used: 1 to 32 Has the code been vectorized or parallelized?: Yes No. of bytes in distributed program, including test data, etc.: 3 798 507 No. of lines in distributed program, including test data, etc.: 187 226 Distribution format: tar.gz Nature of physical problem: In this set of codes an efficient procedure is implemented to describe the wavefunction and related molecular properties of a polyatomic molecular system within the Single Center of Expansion (SCE) approximation. The resulting SCE wavefunction, electron density, electrostatic and exchange/correlation potentials can then be used via a proper Application Programming Interface (API) to describe the target molecular system which can be employed in electron-molecule scattering calculations. The molecular properties expanded over a single center turn out to also be of more general application and some possible uses in quantum chemistry, biomodelling and drug design are also outlined. Method of solution: The polycentre Hartee-Fock solution for a molecule of arbitrary geometry, based on linear combination of Gaussian-Type Orbital (GTO), is expanded over a single center, typically the Center Of Mass (C.O.M.), by means of a Gauss-Legendre/Chebyschev quadrature over the θ, φ angular coordinates. The resulting SCE numerical wavefunction is then used to calculate the one-particle electron density, the electrostatic potential and two different models for the correlation/polarization potentials induced by the impinging electron, which have the correct asymptotic behaviour for the leading dipole molecular polarizabilities. Restrictions on the complexity of the problem: Depending on the molecular system under study and on the operating conditions the program may or may not fit into available RAM memory. In this case a feature of the program is to memory map a disk file in order to efficiently access the memory data through a disk device. Typical running time: The execution time strongly depends on the molecular target description and on the hardware/OS chosen, it is directly proportional to the ( r, θ, φ) grid size and to the number of angular basis functions used. Thus, from the program printout of the main arrays memory occupancy, the user can approximately derive the expected computer time needed for a given calculation executed in serial mode. For parallel executions the overall efficiency must be further taken into account, and this depends on the no. of processors used as well as on the parallel architecture chosen, so a simple general law is at present not determinable. Unusual features of the program: The code has been engineered to use dynamical, runtime determined, global parameters with the aim to have all the data fitted in the RAM memory. Some unusual circumstances, e.g., when using large values of those parameters, may cause the program to run with unexpected performance reductions due to runtime bottlenecks like those caused by memory swap operations which strongly depend on the hardware used. In such cases, a parallel execution of the code is generally sufficient to fix the problem since the data size is partitioned over the available processors. When a suitable parallel system is not available for execution, a mechanism of memory mapped file can be used; with this option on, all the available memory will be used as a buffer for a disk file which contains the whole data set, thus having a better throughput with respect to the traditional swapping/paging of the Unix OS.
RACER: Effective Race Detection Using AspectJ

NASA Technical Reports Server (NTRS)

Bodden, Eric; Havelund, Klaus

2008-01-01

The limits of coding with joint constraints on detected and undetected error rates Programming errors occur frequently in large software systems, and even more so if these systems are concurrent. In the past, researchers have developed specialized programs to aid programmers detecting concurrent programming errors such as deadlocks, livelocks, starvation and data races. In this work we propose a language extension to the aspect-oriented programming language AspectJ, in the form of three new built-in pointcuts, lock(), unlock() and may be Shared(), which allow programmers to monitor program events where locks are granted or handed back, and where values are accessed that may be shared amongst multiple Java threads. We decide thread-locality using a static thread-local objects analysis developed by others. Using the three new primitive pointcuts, researchers can directly implement efficient monitoring algorithms to detect concurrent programming errors online. As an example, we expose a new algorithm which we call RACER, an adoption of the well-known ERASER algorithm to the memory model of Java. We implemented the new pointcuts as an extension to the Aspect Bench Compiler, implemented the RACER algorithm using this language extension and then applied the algorithm to the NASA K9 Rover Executive. Our experiments proved our implementation very effective. In the Rover Executive RACER finds 70 data races. Only one of these races was previously known.We further applied the algorithm to two other multi-threaded programs written by Computer Science researchers, in which we found races as well.

ELAS: A general-purpose computer program for the equilibrium problems of linear structures. Volume 2: Documentation of the program. [subroutines and flow charts

NASA Technical Reports Server (NTRS)

Utku, S.

1969-01-01

A general purpose digital computer program for the in-core solution of linear equilibrium problems of structural mechanics is documented. The program requires minimum input for the description of the problem. The solution is obtained by means of the displacement method and the finite element technique. Almost any geometry and structure may be handled because of the availability of linear, triangular, quadrilateral, tetrahedral, hexahedral, conical, triangular torus, and quadrilateral torus elements. The assumption of piecewise linear deflection distribution insures monotonic convergence of the deflections from the stiffer side with decreasing mesh size. The stresses are provided by the best-fit strain tensors in the least squares at the mesh points where the deflections are given. The selection of local coordinate systems whenever necessary is automatic. The core memory is used by means of dynamic memory allocation, an optional mesh-point relabelling scheme and imposition of the boundary conditions during the assembly time.
We Remember, We Forget: Collaborative Remembering in Older Couples

ERIC Educational Resources Information Center

Harris, Celia B.; Keil, Paul G.; Sutton, John; Barnier, Amanda J.; McIlwain, Doris J. F.

2011-01-01

Transactive memory theory describes the processes by which benefits for memory can occur when remembering is shared in dyads or groups. In contrast, cognitive psychology experiments demonstrate that social influences on memory disrupt and inhibit individual recall. However, most research in cognitive psychology has focused on groups of strangers…
76 FR 12821 - 150th Anniversary of the Inauguration of Abraham Lincoln

Federal Register 2010, 2011, 2012, 2013, 2014

2011-03-09

... together by shared memories and common hopes. As we observe the 150th anniversary of his Inauguration, we... his memory enabled America to move beyond a young collection of States to become a free and unified... memory and uphold the principles he so nobly advanced. [[Page 12822
Expert Systems on Multiprocessor Architectures. Volume 2. Technical Reports

DTIC Science & Technology

1991-06-01

Report RC 12936 (#58037). IBM T. J. Wartson Reiearch Center. July 1987. � Alan Jay Smith. Cache memories. Coniputing Sitrry., 1.1(3): I.3-5:30...basic-shared is an instrument for ashared memory design. The components panels are processor- qload-scrolling-bar-panel, memory-qload-scrolling-bar-panel
Blanket Gate Would Address Blocks Of Memory

NASA Technical Reports Server (NTRS)

Lambe, John; Moopenn, Alexander; Thakoor, Anilkumar P.

1988-01-01

Circuit-chip area used more efficiently. Proposed gate structure selectively allows and restricts access to blocks of memory in electronic neural-type network. By breaking memory into independent blocks, gate greatly simplifies problem of reading from and writing to memory. Since blocks not used simultaneously, share operational amplifiers that prompt and read information stored in memory cells. Fewer operational amplifiers needed, and chip area occupied reduced correspondingly. Cost per bit drops as result.
The potential of multi-port optical memories in digital computing

NASA Technical Reports Server (NTRS)

Alford, C. O.; Gaylord, T. K.

1975-01-01

A high-capacity memory with a relatively high data transfer rate and multi-port simultaneous access capability may serve as the basis for new computer architectures. The implementation of a multi-port optical memory is discussed. Several computer structures are presented that might profitably use such a memory. These structures include (1) a simultaneous record access system, (2) a simultaneously shared memory computer system, and (3) a parallel digital processing structure.
Fortran programs for the time-dependent Gross-Pitaevskii equation in a fully anisotropic trap

NASA Astrophysics Data System (ADS)

Muruganandam, P.; Adhikari, S. K.

2009-10-01

Here we develop simple numerical algorithms for both stationary and non-stationary solutions of the time-dependent Gross-Pitaevskii (GP) equation describing the properties of Bose-Einstein condensates at ultra low temperatures. In particular, we consider algorithms involving real- and imaginary-time propagation based on a split-step Crank-Nicolson method. In a one-space-variable form of the GP equation we consider the one-dimensional, two-dimensional circularly-symmetric, and the three-dimensional spherically-symmetric harmonic-oscillator traps. In the two-space-variable form we consider the GP equation in two-dimensional anisotropic and three-dimensional axially-symmetric traps. The fully-anisotropic three-dimensional GP equation is also considered. Numerical results for the chemical potential and root-mean-square size of stationary states are reported using imaginary-time propagation programs for all the cases and compared with previously obtained results. Also presented are numerical results of non-stationary oscillation for different trap symmetries using real-time propagation programs. A set of convenient working codes developed in Fortran 77 are also provided for all these cases (twelve programs in all). In the case of two or three space variables, Fortran 90/95 versions provide some simplification over the Fortran 77 programs, and these programs are also included (six programs in all). Program summaryProgram title: (i) imagetime1d, (ii) imagetime2d, (iii) imagetime3d, (iv) imagetimecir, (v) imagetimesph, (vi) imagetimeaxial, (vii) realtime1d, (viii) realtime2d, (ix) realtime3d, (x) realtimecir, (xi) realtimesph, (xii) realtimeaxial Catalogue identifier: AEDU_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDU_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 122 907 No. of bytes in distributed program, including test data, etc.: 609 662 Distribution format: tar.gz Programming language: FORTRAN 77 and Fortran 90/95 Computer: PC Operating system: Linux, Unix RAM: 1 GByte (i, iv, v), 2 GByte (ii, vi, vii, x, xi), 4 GByte (iii, viii, xii), 8 GByte (ix) Classification: 2.9, 4.3, 4.12 Nature of problem: These programs are designed to solve the time-dependent Gross-Pitaevskii nonlinear partial differential equation in one-, two- or three-space dimensions with a harmonic, circularly-symmetric, spherically-symmetric, axially-symmetric or anisotropic trap. The Gross-Pitaevskii equation describes the properties of a dilute trapped Bose-Einstein condensate. Solution method: The time-dependent Gross-Pitaevskii equation is solved by the split-step Crank-Nicolson method by discretizing in space and time. The discretized equation is then solved by propagation, in either imaginary or real time, over small time steps. The method yields the solution of stationary and/or non-stationary problems. Additional comments: This package consists of 12 programs, see "Program title", above. FORTRAN77 versions are provided for each of the 12 and, in addition, Fortran 90/95 versions are included for ii, iii, vi, viii, ix, xii. For the particular purpose of each program please see the below. Running time: Minutes on a medium PC (i, iv, v, vii, x, xi), a few hours on a medium PC (ii, vi, viii, xii), days on a medium PC (iii, ix). Program summary (1)Title of program: imagtime1d.F Title of electronic file: imagtime1d.tar.gz Catalogue identifier: Program summary URL: Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Distribution format: tar.gz Computers: PC/Linux, workstation/UNIX Maximum RAM memory: 1 GByte Programming language used: Fortran 77 Typical running time: Minutes on a medium PC Unusual features: None Nature of physical problem: This program is designed to solve the time-dependent Gross-Pitaevskii nonlinear partial differential equation in one-space dimension with a harmonic trap. The Gross-Pitaevskii equation describes the properties of a dilute trapped Bose-Einstein condensate. Method of solution: The time-dependent Gross-Pitaevskii equation is solved by the split-step Crank-Nicolson method by discretizing in space and time. The discretized equation is then solved by propagation in imaginary time over small time steps. The method yields the solution of stationary problems. Program summary (2)Title of program: imagtimecir.F Title of electronic file: imagtimecir.tar.gz Catalogue identifier: Program summary URL: Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Distribution format: tar.gz Computers: PC/Linux, workstation/UNIX Maximum RAM memory: 1 GByte Programming language used: Fortran 77 Typical running time: Minutes on a medium PC Unusual features: None Nature of physical problem: This program is designed to solve the time-dependent Gross-Pitaevskii nonlinear partial differential equation in two-space dimensions with a circularly-symmetric trap. The Gross-Pitaevskii equation describes the properties of a dilute trapped Bose-Einstein condensate. Method of solution: The time-dependent Gross-Pitaevskii equation is solved by the split-step Crank-Nicolson method by discretizing in space and time. The discretized equation is then solved by propagation in imaginary time over small time steps. The method yields the solution of stationary problems. Program summary (3)Title of program: imagtimesph.F Title of electronic file: imagtimesph.tar.gz Catalogue identifier: Program summary URL: Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Distribution format: tar.gz Computers: PC/Linux, workstation/UNIX Maximum RAM memory: 1 GByte Programming language used: Fortran 77 Typical running time: Minutes on a medium PC Unusual features: None Nature of physical problem: This program is designed to solve the time-dependent Gross-Pitaevskii nonlinear partial differential equation in three-space dimensions with a spherically-symmetric trap. The Gross-Pitaevskii equation describes the properties of a dilute trapped Bose-Einstein condensate. Method of solution: The time-dependent Gross-Pitaevskii equation is solved by the split-step Crank-Nicolson method by discretizing in space and time. The discretized equation is then solved by propagation in imaginary time over small time steps. The method yields the solution of stationary problems. Program summary (4)Title of program: realtime1d.F Title of electronic file: realtime1d.tar.gz Catalogue identifier: Program summary URL: Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Distribution format: tar.gz Computers: PC/Linux, workstation/UNIX Maximum RAM memory: 2 GByte Programming language used: Fortran 77 Typical running time: Minutes on a medium PC Unusual features: None Nature of physical problem: This program is designed to solve the time-dependent Gross-Pitaevskii nonlinear partial differential equation in one-space dimension with a harmonic trap. The Gross-Pitaevskii equation describes the properties of a dilute trapped Bose-Einstein condensate. Method of solution: The time-dependent Gross-Pitaevskii equation is solved by the split-step Crank-Nicolson method by discretizing in space and time. The discretized equation is then solved by propagation in real time over small time steps. The method yields the solution of stationary and non-stationary problems. Program summary (5)Title of program: realtimecir.F Title of electronic file: realtimecir.tar.gz Catalogue identifier: Program summary URL: Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Distribution format: tar.gz Computers: PC/Linux, workstation/UNIX Maximum RAM memory: 2 GByte Programming language used: Fortran 77 Typical running time: Minutes on a medium PC Unusual features: None Nature of physical problem: This program is designed to solve the time-dependent Gross-Pitaevskii nonlinear partial differential equation in two-space dimensions with a circularly-symmetric trap. The Gross-Pitaevskii equation describes the properties of a dilute trapped Bose-Einstein condensate. Method of solution: The time-dependent Gross-Pitaevskii equation is solved by the split-step Crank-Nicolson method by discretizing in space and time. The discretized equation is then solved by propagation in real time over small time steps. The method yields the solution of stationary and non-stationary problems. Program summary (6)Title of program: realtimesph.F Title of electronic file: realtimesph.tar.gz Catalogue identifier: Program summary URL: Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Distribution format: tar.gz Computers: PC/Linux, workstation/UNIX Maximum RAM memory: 2 GByte Programming language used: Fortran 77 Typical running time: Minutes on a medium PC Unusual features: None Nature of physical problem: This program is designed to solve the time-dependent Gross-Pitaevskii nonlinear partial differential equation in three-space dimensions with a spherically-symmetric trap. The Gross-Pitaevskii equation describes the properties of a dilute trapped Bose-Einstein condensate. Method of solution: The time-dependent Gross-Pitaevskii equation is solved by the split-step Crank-Nicolson method by discretizing in space and time. The discretized equation is then solved by propagation in real time over small time steps. The method yields the solution of stationary and non-stationary problems. Program summary (7)Title of programs: imagtimeaxial.F and imagtimeaxial.f90 Title of electronic file: imagtimeaxial.tar.gz Catalogue identifier: Program summary URL: Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Distribution format: tar.gz Computers: PC/Linux, workstation/UNIX Maximum RAM memory: 2 GByte Programming language used: Fortran 77 and Fortran 90 Typical running time: Few hours on a medium PC Unusual features: None Nature of physical problem: This program is designed to solve the time-dependent Gross-Pitaevskii nonlinear partial differential equation in three-space dimensions with an axially-symmetric trap. The Gross-Pitaevskii equation describes the properties of a dilute trapped Bose-Einstein condensate. Method of solution: The time-dependent Gross-Pitaevskii equation is solved by the split-step Crank-Nicolson method by discretizing in space and time. The discretized equation is then solved by propagation in imaginary time over small time steps. The method yields the solution of stationary problems. Program summary (8)Title of program: imagtime2d.F and imagtime2d.f90 Title of electronic file: imagtime2d.tar.gz Catalogue identifier: Program summary URL: Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Distribution format: tar.gz Computers: PC/Linux, workstation/UNIX Maximum RAM memory: 2 GByte Programming language used: Fortran 77 and Fortran 90 Typical running time: Few hours on a medium PC Unusual features: None Nature of physical problem: This program is designed to solve the time-dependent Gross-Pitaevskii nonlinear partial differential equation in two-space dimensions with an anisotropic trap. The Gross-Pitaevskii equation describes the properties of a dilute trapped Bose-Einstein condensate. Method of solution: The time-dependent Gross-Pitaevskii equation is solved by the split-step Crank-Nicolson method by discretizing in space and time. The discretized equation is then solved by propagation in imaginary time over small time steps. The method yields the solution of stationary problems. Program summary (9)Title of program: realtimeaxial.F and realtimeaxial.f90 Title of electronic file: realtimeaxial.tar.gz Catalogue identifier: Program summary URL: Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Distribution format: tar.gz Computers: PC/Linux, workstation/UNIX Maximum RAM memory: 4 GByte Programming language used: Fortran 77 and Fortran 90 Typical running time Hours on a medium PC Unusual features: None Nature of physical problem: This program is designed to solve the time-dependent Gross-Pitaevskii nonlinear partial differential equation in three-space dimensions with an axially-symmetric trap. The Gross-Pitaevskii equation describes the properties of a dilute trapped Bose-Einstein condensate. Method of solution: The time-dependent Gross-Pitaevskii equation is solved by the split-step Crank-Nicolson method by discretizing in space and time. The discretized equation is then solved by propagation in real time over small time steps. The method yields the solution of stationary and non-stationary problems. Program summary (10)Title of program: realtime2d.F and realtime2d.f90 Title of electronic file: realtime2d.tar.gz Catalogue identifier: Program summary URL: Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Distribution format: tar.gz Computers: PC/Linux, workstation/UNIX Maximum RAM memory: 4 GByte Programming language used: Fortran 77 and Fortran 90 Typical running time: Hours on a medium PC Unusual features: None Nature of physical problem: This program is designed to solve the time-dependent Gross-Pitaevskii nonlinear partial differential equation in two-space dimensions with an anisotropic trap. The Gross-Pitaevskii equation describes the properties of a dilute trapped Bose-Einstein condensate. Method of solution: The time-dependent Gross-Pitaevskii equation is solved by the split-step Crank-Nicolson method by discretizing in space and time. The discretized equation is then solved by propagation in real time over small time steps. The method yields the solution of stationary and non-stationary problems. Program summary (11)Title of program: imagtime3d.F and imagtime3d.f90 Title of electronic file: imagtime3d.tar.gz Catalogue identifier: Program summary URL: Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Distribution format: tar.gz Computers: PC/Linux, workstation/UNIX Maximum RAM memory: 4 GByte Programming language used: Fortran 77 and Fortran 90 Typical running time: Few days on a medium PC Unusual features: None Nature of physical problem: This program is designed to solve the time-dependent Gross-Pitaevskii nonlinear partial differential equation in three-space dimensions with an anisotropic trap. The Gross-Pitaevskii equation describes the properties of a dilute trapped Bose-Einstein condensate. Method of solution: The time-dependent Gross-Pitaevskii equation is solved by the split-step Crank-Nicolson method by discretizing in space and time. The discretized equation is then solved by propagation in imaginary time over small time steps. The method yields the solution of stationary problems. Program summary (12)Title of program: realtime3d.F and realtime3d.f90 Title of electronic file: realtime3d.tar.gz Catalogue identifier: Program summary URL: Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Distribution format: tar.gz Computers: PC/Linux, workstation/UNIX Maximum Ram Memory: 8 GByte Programming language used: Fortran 77 and Fortran 90 Typical running time: Days on a medium PC Unusual features: None Nature of physical problem: This program is designed to solve the time-dependent Gross-Pitaevskii nonlinear partial differential equation in three-space dimensions with an anisotropic trap. The Gross-Pitaevskii equation describes the properties of a dilute trapped Bose-Einstein condensate. Method of solution: The time-dependent Gross-Pitaevskii equation is solved by the split-step Crank-Nicolson method by discretizing in space and time. The discretized equation is then solved by propagation in real time over small time steps. The method yields the solution of stationary and non-stationary problems.
The declarative/procedural model of lexicon and grammar.

PubMed

Ullman, M T

2001-01-01

Our use of language depends upon two capacities: a mental lexicon of memorized words and a mental grammar of rules that underlie the sequential and hierarchical composition of lexical forms into predictably structured larger words, phrases, and sentences. The declarative/procedural model posits that the lexicon/grammar distinction in language is tied to the distinction between two well-studied brain memory systems. On this view, the memorization and use of at least simple words (those with noncompositional, that is, arbitrary form-meaning pairings) depends upon an associative memory of distributed representations that is subserved by temporal-lobe circuits previously implicated in the learning and use of fact and event knowledge. This "declarative memory" system appears to be specialized for learning arbitrarily related information (i.e., for associative binding). In contrast, the acquisition and use of grammatical rules that underlie symbol manipulation is subserved by frontal/basal-ganglia circuits previously implicated in the implicit (nonconscious) learning and expression of motor and cognitive "skills" and "habits" (e.g., from simple motor acts to skilled game playing). This "procedural" system may be specialized for computing sequences. This novel view of lexicon and grammar offers an alternative to the two main competing theoretical frameworks. It shares the perspective of traditional dual-mechanism theories in positing that the mental lexicon and a symbol-manipulating mental grammar are subserved by distinct computational components that may be linked to distinct brain structures. However, it diverges from these theories where they assume components dedicated to each of the two language capacities (that is, domain-specific) and in their common assumption that lexical memory is a rote list of items. Conversely, while it shares with single-mechanism theories the perspective that the two capacities are subserved by domain-independent computational mechanisms, it diverges from them where they link both capacities to a single associative memory system with broad anatomic distribution. The declarative/procedural model, but neither traditional dual- nor single-mechanism models, predicts double dissociations between lexicon and grammar, with associations among associative memory properties, memorized words and facts, and temporal-lobe structures, and among symbol-manipulation properties, grammatical rule products, motor skills, and frontal/basal-ganglia structures. In order to contrast lexicon and grammar while holding other factors constant, we have focused our investigations of the declarative/procedural model on morphologically complex word forms. Morphological transformations that are (largely) unproductive (e.g., in go-went, solemn-solemnity) are hypothesized to depend upon declarative memory. These have been contrasted with morphological transformations that are fully productive (e.g., in walk-walked, happy-happiness), whose computation is posited to be solely dependent upon grammatical rules subserved by the procedural system. Here evidence is presented from studies that use a range of psycholinguistic and neurolinguistic approaches with children and adults. It is argued that converging evidence from these studies supports the declarative/procedural model of lexicon and grammar.
Facilitating change in health-related behaviors and intentions: a randomized controlled trial of a multidimensional memory program for older adults.

PubMed

Wiegand, Melanie A; Troyer, Angela K; Gojmerac, Christina; Murphy, Kelly J

2013-01-01

Many older adults are concerned about memory changes with age and consequently seek ways to optimize their memory function. Memory programs are known to be variably effective in improving memory knowledge, other aspects of metamemory, and/or objective memory, but little is known about their impact on implementing and sustaining lifestyle and healthcare-seeking intentions and behaviors. We evaluated a multidimensional, evidence-based intervention, the Memory and Aging Program, that provides education about memory and memory change, training in the use of practical memory strategies, and support for implementation of healthy lifestyle behavior changes. In a randomized controlled trial, 42 healthy older adults participated in a program (n = 21) or a waitlist control (n = 21) group. Relative to the control group, participants in the program implemented more healthy lifestyle behaviors by the end of the program and maintained these changes 1 month later. Similarly, program participants reported a decreased intention to seek unnecessary medical attention for their memory immediately after the program and 1 month later. Findings support the use of multidimensional memory programs to promote healthy lifestyles and influence healthcare-seeking behaviors. Discussion focuses on implications of these changes for maximizing cognitive health and minimizing impact on healthcare resources.
Building A Cloud Based Distributed Active Data Archive Center

NASA Technical Reports Server (NTRS)

Ramachandran, Rahul; Baynes, Katie; Murphy, Kevin

2017-01-01

NASA's Earth Science Data System (ESDS) Program facilitates the implementation of NASA's Earth Science strategic plan, which is committed to the full and open sharing of Earth science data obtained from NASA instruments to all users. The Earth Science Data information System (ESDIS) project manages the Earth Observing System Data and Information System (EOSDIS). Data within EOSDIS are held at Distributed Active Archive Centers (DAACs). One of the key responsibilities of the ESDS Program is to continuously evolve the entire data and information system to maximize returns on the collected NASA data.
Sparse distributed memory overview

NASA Technical Reports Server (NTRS)

Raugh, Mike

1990-01-01

The Sparse Distributed Memory (SDM) project is investigating the theory and applications of massively parallel computing architecture, called sparse distributed memory, that will support the storage and retrieval of sensory and motor patterns characteristic of autonomous systems. The immediate objectives of the project are centered in studies of the memory itself and in the use of the memory to solve problems in speech, vision, and robotics. Investigation of methods for encoding sensory data is an important part of the research. Examples of NASA missions that may benefit from this work are Space Station, planetary rovers, and solar exploration. Sparse distributed memory offers promising technology for systems that must learn through experience and be capable of adapting to new circumstances, and for operating any large complex system requiring automatic monitoring and control. Sparse distributed memory is a massively parallel architecture motivated by efforts to understand how the human brain works. Sparse distributed memory is an associative memory, able to retrieve information from cues that only partially match patterns stored in the memory. It is able to store long temporal sequences derived from the behavior of a complex system, such as progressive records of the system's sensory data and correlated records of the system's motor controls.
Methodology for fast detection of false sharing in threaded scientific codes

DOEpatents

Chung, I-Hsin; Cong, Guojing; Murata, Hiroki; Negishi, Yasushi; Wen, Hui-Fang

2014-11-25

A profiling tool identifies a code region with a false sharing potential. A static analysis tool classifies variables and arrays in the identified code region. A mapping detection library correlates memory access instructions in the identified code region with variables and arrays in the identified code region while a processor is running the identified code region. The mapping detection library identifies one or more instructions at risk, in the identified code region, which are subject to an analysis by a false sharing detection library. A false sharing detection library performs a run-time analysis of the one or more instructions at risk while the processor is re-running the identified code region. The false sharing detection library determines, based on the performed run-time analysis, whether two different portions of the cache memory line are accessed by the generated binary code.
Remote quantum entanglement between two micromechanical oscillators.

PubMed

Riedinger, Ralf; Wallucks, Andreas; Marinković, Igor; Löschnauer, Clemens; Aspelmeyer, Markus; Hong, Sungkun; Gröblacher, Simon

2018-04-01

Entanglement, an essential feature of quantum theory that allows for inseparable quantum correlations to be shared between distant parties, is a crucial resource for quantum networks 1 . Of particular importance is the ability to distribute entanglement between remote objects that can also serve as quantum memories. This has been previously realized using systems such as warm 2,3 and cold atomic vapours 4,5 , individual atoms 6 and ions 7,8 , and defects in solid-state systems 9-11 . Practical communication applications require a combination of several advantageous features, such as a particular operating wavelength, high bandwidth and long memory lifetimes. Here we introduce a purely micromachined solid-state platform in the form of chip-based optomechanical resonators made of nanostructured silicon beams. We create and demonstrate entanglement between two micromechanical oscillators across two chips that are separated by 20 centimetres . The entangled quantum state is distributed by an optical field at a designed wavelength near 1,550 nanometres. Therefore, our system can be directly incorporated in a realistic fibre-optic quantum network operating in the conventional optical telecommunication band. Our results are an important step towards the development of large-area quantum networks based on silicon photonics.
Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search

DTIC Science & Technology

2013-01-03

Jun. 1972. [2] W. McLendon III, B. Hendrickson, S . J. Plimpton , and L. Rauchwerger, “Finding strongly connected components in distributed graphs,” J...Breadth-First Search Revisited: Enabling Bottom-Up Search 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR( S ) 5d. PROJECT...NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME( S ) AND ADDRESS(ES) University of California at Berkeley,Electrical
States of mind: Emotions, body feelings, and thoughts share distributed neural networks

PubMed Central

Oosterwijk, Suzanne; Lindquist, Kristen A.; Anderson, Eric; Dautoff, Rebecca; Moriguchi, Yoshiya; Barrett, Lisa Feldman

2012-01-01

Scientists have traditionally assumed that different kinds of mental states (e.g., fear, disgust, love, memory, planning, concentration, etc.) correspond to different psychological faculties that have domain-specific correlates in the brain. Yet, growing evidence points to the constructionist hypothesis that mental states emerge from the combination of domain-general psychological processes that map to large-scale distributed brain networks. In this paper, we report a novel study testing a constructionist model of the mind in which participants generated three kinds of mental states (emotions, body feelings, or thoughts) while we measured activity within large-scale distributed brain networks using fMRI. We examined the similarity and differences in the pattern of network activity across these three classes of mental states. Consistent with a constructionist hypothesis, a combination of large-scale distributed networks contributed to emotions, thoughts, and body feelings, although these mental states differed in the relative contribution of those networks. Implications for a constructionist functional architecture of diverse mental states are discussed. PMID:22677148
A parallel solver for huge dense linear systems

NASA Astrophysics Data System (ADS)

Badia, J. M.; Movilla, J. L.; Climente, J. I.; Castillo, M.; Marqués, M.; Mayo, R.; Quintana-Ortí, E. S.; Planelles, J.

2011-11-01

HDSS (Huge Dense Linear System Solver) is a Fortran Application Programming Interface (API) to facilitate the parallel solution of very large dense systems to scientists and engineers. The API makes use of parallelism to yield an efficient solution of the systems on a wide range of parallel platforms, from clusters of processors to massively parallel multiprocessors. It exploits out-of-core strategies to leverage the secondary memory in order to solve huge linear systems O(100.000). The API is based on the parallel linear algebra library PLAPACK, and on its Out-Of-Core (OOC) extension POOCLAPACK. Both PLAPACK and POOCLAPACK use the Message Passing Interface (MPI) as the communication layer and BLAS to perform the local matrix operations. The API provides a friendly interface to the users, hiding almost all the technical aspects related to the parallel execution of the code and the use of the secondary memory to solve the systems. In particular, the API can automatically select the best way to store and solve the systems, depending of the dimension of the system, the number of processes and the main memory of the platform. Experimental results on several parallel platforms report high performance, reaching more than 1 TFLOP with 64 cores to solve a system with more than 200 000 equations and more than 10 000 right-hand side vectors. New version program summaryProgram title: Huge Dense System Solver (HDSS) Catalogue identifier: AEHU_v1_1 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEHU_v1_1.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 87 062 No. of bytes in distributed program, including test data, etc.: 1 069 110 Distribution format: tar.gz Programming language: Fortran90, C Computer: Parallel architectures: multiprocessors, computer clusters Operating system: Linux/Unix Has the code been vectorized or parallelized?: Yes, includes MPI primitives. RAM: Tested for up to 190 GB Classification: 6.5 External routines: MPI ( http://www.mpi-forum.org/), BLAS ( http://www.netlib.org/blas/), PLAPACK ( http://www.cs.utexas.edu/~plapack/), POOCLAPACK ( ftp://ftp.cs.utexas.edu/pub/rvdg/PLAPACK/pooclapack.ps) (code for PLAPACK and POOCLAPACK is included in the distribution). Catalogue identifier of previous version: AEHU_v1_0 Journal reference of previous version: Comput. Phys. Comm. 182 (2011) 533 Does the new version supersede the previous version?: Yes Nature of problem: Huge scale dense systems of linear equations, Ax=B, beyond standard LAPACK capabilities. Solution method: The linear systems are solved by means of parallelized routines based on the LU factorization, using efficient secondary storage algorithms when the available main memory is insufficient. Reasons for new version: In many applications we need to guarantee a high accuracy in the solution of very large linear systems and we can do it by using double-precision arithmetic. Summary of revisions: Version 1.1 Can be used to solve linear systems using double-precision arithmetic. New version of the initialization routine. The user can choose the kind of arithmetic and the values of several parameters of the environment. Running time: About 5 hours to solve a system with more than 200 000 equations and more than 10 000 right-hand side vectors using double-precision arithmetic on an eight-node commodity cluster with a total of 64 Intel cores.
MCdevelop - a universal framework for Stochastic Simulations

NASA Astrophysics Data System (ADS)

Slawinska, M.; Jadach, S.

2011-03-01

We present MCdevelop, a universal computer framework for developing and exploiting the wide class of Stochastic Simulations (SS) software. This powerful universal SS software development tool has been derived from a series of scientific projects for precision calculations in high energy physics (HEP), which feature a wide range of functionality in the SS software needed for advanced precision Quantum Field Theory calculations for the past LEP experiments and for the ongoing LHC experiments at CERN, Geneva. MCdevelop is a "spin-off" product of HEP to be exploited in other areas, while it will still serve to develop new SS software for HEP experiments. Typically SS involve independent generation of large sets of random "events", often requiring considerable CPU power. Since SS jobs usually do not share memory it makes them easy to parallelize. The efficient development, testing and running in parallel SS software requires a convenient framework to develop software source code, deploy and monitor batch jobs, merge and analyse results from multiple parallel jobs, even before the production runs are terminated. Throughout the years of development of stochastic simulations for HEP, a sophisticated framework featuring all the above mentioned functionality has been implemented. MCdevelop represents its latest version, written mostly in C++ (GNU compiler gcc). It uses Autotools to build binaries (optionally managed within the KDevelop 3.5.3 Integrated Development Environment (IDE)). It uses the open-source ROOT package for histogramming, graphics and the mechanism of persistency for the C++ objects. MCdevelop helps to run multiple parallel jobs on any computer cluster with NQS-type batch system. Program summaryProgram title:MCdevelop Catalogue identifier: AEHW_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEHW_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 48 136 No. of bytes in distributed program, including test data, etc.: 355 698 Distribution format: tar.gz Programming language: ANSI C++ Computer: Any computer system or cluster with C++ compiler and UNIX-like operating system. Operating system: Most UNIX systems, Linux. The application programs were thoroughly tested under Ubuntu 7.04, 8.04 and CERN Scientific Linux 5. Has the code been vectorised or parallelised?: Tools (scripts) for optional parallelisation on a PC farm are included. RAM: 500 bytes Classification: 11.3 External routines: ROOT package version 5.0 or higher ( http://root.cern.ch/drupal/). Nature of problem: Developing any type of stochastic simulation program for high energy physics and other areas. Solution method: Object Oriented programming in C++ with added persistency mechanism, batch scripts for running on PC farms and Autotools.
High-Efficiency High-Resolution Global Model Developments at the NASA Goddard Data Assimilation Office

NASA Technical Reports Server (NTRS)

Lin, Shian-Jiann; Atlas, Robert (Technical Monitor)

2002-01-01

The Data Assimilation Office (DAO) has been developing a new generation of ultra-high resolution General Circulation Model (GCM) that is suitable for 4-D data assimilation, numerical weather predictions, and climate simulations. These three applications have conflicting requirements. For 4-D data assimilation and weather predictions, it is highly desirable to run the model at the highest possible spatial resolution (e.g., 55 km or finer) so as to be able to resolve and predict socially and economically important weather phenomena such as tropical cyclones, hurricanes, and severe winter storms. For climate change applications, the model simulations need to be carried out for decades, if not centuries. To reduce uncertainty in climate change assessments, the next generation model would also need to be run at a fine enough spatial resolution that can at least marginally simulate the effects of intense tropical cyclones. Scientific problems (e.g., parameterization of subgrid scale moist processes) aside, all three areas of application require the model's computational performance to be dramatically improved as compared to the previous generation. In this talk, I will present the current and future developments of the "finite-volume dynamical core" at the Data Assimilation Office. This dynamical core applies modem monotonicity preserving algorithms and is genuinely conservative by construction, not by an ad hoc fixer. The "discretization" of the conservation laws is purely local, which is clearly advantageous for resolving sharp gradient flow features. In addition, the local nature of the finite-volume discretization also has a significant advantage on distributed memory parallel computers. Together with a unique vertically Lagrangian control volume discretization that essentially reduces the dimension of the computational problem from three to two, the finite-volume dynamical core is very efficient, particularly at high resolutions. I will also present the computational design of the dynamical core using a hybrid distributed-shared memory programming paradigm that is portable to virtually any of today's high-end parallel super-computing clusters.
High-Efficiency High-Resolution Global Model Developments at the NASA Goddard Data Assimilation Office

NASA Technical Reports Server (NTRS)

Lin, Shian-Jiann; Atlas, Robert (Technical Monitor)

2002-01-01

The Data Assimilation Office (DAO) has been developing a new generation of ultra-high resolution General Circulation Model (GCM) that is suitable for 4-D data assimilation, numerical weather predictions, and climate simulations. These three applications have conflicting requirements. For 4-D data assimilation and weather predictions, it is highly desirable to run the model at the highest possible spatial resolution (e.g., 55 kin or finer) so as to be able to resolve and predict socially and economically important weather phenomena such as tropical cyclones, hurricanes, and severe winter storms. For climate change applications, the model simulations need to be carried out for decades, if not centuries. To reduce uncertainty in climate change assessments, the next generation model would also need to be run at a fine enough spatial resolution that can at least marginally simulate the effects of intense tropical cyclones. Scientific problems (e.g., parameterization of subgrid scale moist processes) aside, all three areas of application require the model's computational performance to be dramatically improved as compared to the previous generation. In this talk, I will present the current and future developments of the "finite-volume dynamical core" at the Data Assimilation Office. This dynamical core applies modem monotonicity preserving algorithms and is genuinely conservative by construction, not by an ad hoc fixer. The "discretization" of the conservation laws is purely local, which is clearly advantageous for resolving sharp gradient flow features. In addition, the local nature of the finite-volume discretization also has a significant advantage on distributed memory parallel computers. Together with a unique vertically Lagrangian control volume discretization that essentially reduces the dimension of the computational problem from three to two, the finite-volume dynamical core is very efficient, particularly at high resolutions. I will also present the computational design of the dynamical core using a hybrid distributed- shared memory programming paradigm that is portable to virtually any of today's high-end parallel super-computing clusters.
Performance and scalability of Fourier domain optical coherence tomography acceleration using graphics processing units.

PubMed

Li, Jian; Bloch, Pavel; Xu, Jing; Sarunic, Marinko V; Shannon, Lesley

2011-05-01

Fourier domain optical coherence tomography (FD-OCT) provides faster line rates, better resolution, and higher sensitivity for noninvasive, in vivo biomedical imaging compared to traditional time domain OCT (TD-OCT). However, because the signal processing for FD-OCT is computationally intensive, real-time FD-OCT applications demand powerful computing platforms to deliver acceptable performance. Graphics processing units (GPUs) have been used as coprocessors to accelerate FD-OCT by leveraging their relatively simple programming model to exploit thread-level parallelism. Unfortunately, GPUs do not "share" memory with their host processors, requiring additional data transfers between the GPU and CPU. In this paper, we implement a complete FD-OCT accelerator on a consumer grade GPU/CPU platform. Our data acquisition system uses spectrometer-based detection and a dual-arm interferometer topology with numerical dispersion compensation for retinal imaging. We demonstrate that the maximum line rate is dictated by the memory transfer time and not the processing time due to the GPU platform's memory model. Finally, we discuss how the performance trends of GPU-based accelerators compare to the expected future requirements of FD-OCT data rates.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.