scalable shared memory: Topics by Science.gov

Sample records for scalable shared memory

A Massively Parallel Code for Polarization Calculations

NASA Astrophysics Data System (ADS)

Akiyama, Shizuka; Höflich, Peter

2001-03-01

We present an implementation of our Monte-Carlo radiation transport method for rapidly expanding, NLTE atmospheres for massively parallel computers which utilizes both the distributed and shared memory models. This allows us to take full advantage of the fast communication and low latency inherent to nodes with multiple CPUs, and to stretch the limits of scalability with the number of nodes compared to a version which is based on the shared memory model. Test calculations on a local 20-node Beowulf cluster with dual CPUs showed an improved scalability by about 40%.
Combining Distributed and Shared Memory Models: Approach and Evolution of the Global Arrays Toolkit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nieplocha, Jarek; Harrison, Robert J.; Kumar, Mukul

2002-07-29

Both shared memory and distributed memory models have advantages and shortcomings. Shared memory model is much easier to use but it ignores data locality/placement. Given the hierarchical nature of the memory subsystems in the modern computers this characteristic might have a negative impact on performance and scalability. Various techniques, such as code restructuring to increase data reuse and introducing blocking in data accesses, can address the problem and yield performance competitive with message passing[Singh], however at the cost of compromising the ease of use feature. Distributed memory models such as message passing or one-sided communication offer performance and scalability butmore » they compromise the ease-of-use. In this context, the message-passing model is sometimes referred to as?assembly programming for the scientific computing?. The Global Arrays toolkit[GA1, GA2] attempts to offer the best features of both models. It implements a shared-memory programming model in which data locality is managed explicitly by the programmer. This management is achieved by explicit calls to functions that transfer data between a global address space (a distributed array) and local storage. In this respect, the GA model has similarities to the distributed shared-memory models that provide an explicit acquire/release protocol. However, the GA model acknowledges that remote data is slower to access than local data and allows data locality to be explicitly specified and hence managed. The GA model exposes to the programmer the hierarchical memory of modern high-performance computer systems, and by recognizing the communication overhead for remote data transfer, it promotes data reuse and locality of reference. This paper describes the characteristics of the Global Arrays programming model, capabilities of the toolkit, and discusses its evolution.« less
Runtime support for parallelizing data mining algorithms

NASA Astrophysics Data System (ADS)

Jin, Ruoming; Agrawal, Gagan

2002-03-01

With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of common data mining algorithms. In addition, we propose a reduction-object based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the technique we have developed starting from a common specification of the algorithm.
Performance Modeling and Measurement of Parallelized Code for Distributed Shared Memory Multiprocessors

NASA Technical Reports Server (NTRS)

Waheed, Abdul; Yan, Jerry

1998-01-01

This paper presents a model to evaluate the performance and overhead of parallelizing sequential code using compiler directives for multiprocessing on distributed shared memory (DSM) systems. With increasing popularity of shared address space architectures, it is essential to understand their performance impact on programs that benefit from shared memory multiprocessing. We present a simple model to characterize the performance of programs that are parallelized using compiler directives for shared memory multiprocessing. We parallelized the sequential implementation of NAS benchmarks using native Fortran77 compiler directives for an Origin2000, which is a DSM system based on a cache-coherent Non Uniform Memory Access (ccNUMA) architecture. We report measurement based performance of these parallelized benchmarks from four perspectives: efficacy of parallelization process; scalability; parallelization overhead; and comparison with hand-parallelized and -optimized version of the same benchmarks. Our results indicate that sequential programs can conveniently be parallelized for DSM systems using compiler directives but realizing performance gains as predicted by the performance model depends primarily on minimizing architecture-specific data locality overhead.
Parallel performance investigations of an unstructured mesh Navier-Stokes solver

NASA Technical Reports Server (NTRS)

Mavriplis, Dimitri J.

2000-01-01

A Reynolds-averaged Navier-Stokes solver based on unstructured mesh techniques for analysis of high-lift configurations is described. The method makes use of an agglomeration multigrid solver for convergence acceleration. Implicit line-smoothing is employed to relieve the stiffness associated with highly stretched meshes. A GMRES technique is also implemented to speed convergence at the expense of additional memory usage. The solver is cache efficient and fully vectorizable, and is parallelized using a two-level hybrid MPI-OpenMP implementation suitable for shared and/or distributed memory architectures, as well as clusters of shared memory machines. Convergence and scalability results are illustrated for various high-lift cases.
DISP: Optimizations towards Scalable MPI Startup

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fu, Huansong; Pophale, Swaroop S; Gorentla Venkata, Manjunath

2016-01-01

Despite the popularity of MPI for high performance computing, the startup of MPI programs faces a scalability challenge as both the execution time and memory consumption increase drastically at scale. We have examined this problem using the collective modules of Cheetah and Tuned in Open MPI as representative implementations. Previous improvements for collectives have focused on algorithmic advances and hardware off-load. In this paper, we examine the startup cost of the collective module within a communicator and explore various techniques to improve its efficiency and scalability. Accordingly, we have developed a new scalable startup scheme with three internal techniques, namelymore » Delayed Initialization, Module Sharing and Prediction-based Topology Setup (DISP). Our DISP scheme greatly benefits the collective initialization of the Cheetah module. At the same time, it helps boost the performance of non-collective initialization in the Tuned module. We evaluate the performance of our implementation on Titan supercomputer at ORNL with up to 4096 processes. The results show that our delayed initialization can speed up the startup of Tuned and Cheetah by an average of 32.0% and 29.2%, respectively, our module sharing can reduce the memory consumption of Tuned and Cheetah by up to 24.1% and 83.5%, respectively, and our prediction-based topology setup can speed up the startup of Cheetah by up to 80%.« less
Ultrahigh-order Maxwell solver with extreme scalability for electromagnetic PIC simulations of plasmas

NASA Astrophysics Data System (ADS)

Vincenti, Henri; Vay, Jean-Luc

2018-07-01

The advent of massively parallel supercomputers, with their distributed-memory technology using many processing units, has favored the development of highly-scalable local low-order solvers at the expense of harder-to-scale global very high-order spectral methods. Indeed, FFT-based methods, which were very popular on shared memory computers, have been largely replaced by finite-difference (FD) methods for the solution of many problems, including plasmas simulations with electromagnetic Particle-In-Cell methods. For some problems, such as the modeling of so-called "plasma mirrors" for the generation of high-energy particles and ultra-short radiations, we have shown that the inaccuracies of standard FD-based PIC methods prevent the modeling on present supercomputers at sufficient accuracy. We demonstrate here that a new method, based on the use of local FFTs, enables ultrahigh-order accuracy with unprecedented scalability, and thus for the first time the accurate modeling of plasma mirrors in 3D.
Hybrid-optimization strategy for the communication of large-scale Kinetic Monte Carlo simulation

NASA Astrophysics Data System (ADS)

Wu, Baodong; Li, Shigang; Zhang, Yunquan; Nie, Ningming

2017-02-01

The parallel Kinetic Monte Carlo (KMC) algorithm based on domain decomposition has been widely used in large-scale physical simulations. However, the communication overhead of the parallel KMC algorithm is critical, and severely degrades the overall performance and scalability. In this paper, we present a hybrid optimization strategy to reduce the communication overhead for the parallel KMC simulations. We first propose a communication aggregation algorithm to reduce the total number of messages and eliminate the communication redundancy. Then, we utilize the shared memory to reduce the memory copy overhead of the intra-node communication. Finally, we optimize the communication scheduling using the neighborhood collective operations. We demonstrate the scalability and high performance of our hybrid optimization strategy by both theoretical and experimental analysis. Results show that the optimized KMC algorithm exhibits better performance and scalability than the well-known open-source library-SPPARKS. On 32-node Xeon E5-2680 cluster (total 640 cores), the optimized algorithm reduces the communication time by 24.8% compared with SPPARKS.
Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Buntinas, D.; Mercier, G.; Gropp, W.

2005-12-02

This paper presents a new low-level communication subsystem called Nemesis. Nemesis has been designed and implemented to be scalable and efficient both in the intranode communication context using shared-memory and in the internode communication case using high-performance networks and is natively multimethod-enabled. Nemesis has been integrated in MPICH2 as a CH3 channel and delivers better performance than other dedicated communication channels in MPICH2. Furthermore, the resulting MPICH2 architecture outperforms other MPI implementations in point-to-point benchmarks.
Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Li, Xiaoye; Husbands, Parry; Biswas, Rupak; Biegel, Bryan (Technical Monitor)

2002-01-01

The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. For systems that are ill-conditioned, it is often necessary to use a preconditioning technique. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and ILU(O) preconditioned CG (PCG) using different programming paradigms and architectures. Results show that for this class of applications: ordering significantly improves overall performance on both distributed and distributed shared-memory systems, that cache reuse may be more important than reducing communication, that it is possible to achieve message-passing performance using shared-memory constructs through careful data ordering and distribution, and that a hybrid MPI+OpenMP paradigm increases programming complexity with little performance gains. A implementation of CG on the Cray MTA does not require special ordering or partitioning to obtain high efficiency and scalability, giving it a distinct advantage for adaptive applications; however, it shows limited scalability for PCG due to a lack of thread level parallelism.
Scalability of a Low-Cost Multi-Teraflop Linux Cluster for High-End Classical Atomistic and Quantum Mechanical Simulations

NASA Technical Reports Server (NTRS)

Kikuchi, Hideaki; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya; Shimojo, Fuyuki; Saini, Subhash

2003-01-01

Scalability of a low-cost, Intel Xeon-based, multi-Teraflop Linux cluster is tested for two high-end scientific applications: Classical atomistic simulation based on the molecular dynamics method and quantum mechanical calculation based on the density functional theory. These scalable parallel applications use space-time multiresolution algorithms and feature computational-space decomposition, wavelet-based adaptive load balancing, and spacefilling-curve-based data compression for scalable I/O. Comparative performance tests are performed on a 1,024-processor Linux cluster and a conventional higher-end parallel supercomputer, 1,184-processor IBM SP4. The results show that the performance of the Linux cluster is comparable to that of the SP4. We also study various effects, such as the sharing of memory and L2 cache among processors, on the performance.
Parallel k-means++ for Multiple Shared-Memory Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mackey, Patrick S.; Lewis, Robert R.

2016-09-22

In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithms that are only approximations of k-means++. In this paper we present a parallelization of the exact k-means++ algorithm, with a proof of its correctness. We develop implementations for three distinct shared-memory architectures: multicore CPU, high performance GPU, and the massively multithreaded Cray XMT platform. We demonstrate the scalability of the algorithm on each platform. In addition we present a visual approach for showing which platform performed k-means++ the fastest for varyingmore » data sizes.« less
Shared virtual memory and generalized speedup

NASA Technical Reports Server (NTRS)

Sun, Xian-He; Zhu, Jianping

1994-01-01

Generalized speedup is defined as parallel speed over sequential speed. The generalized speedup and its relation with other existing performance metrics, such as traditional speedup, efficiency, scalability, etc., are carefully studied. In terms of the introduced asymptotic speed, it was shown that the difference between the generalized speedup and the traditional speedup lies in the definition of the efficiency of uniprocessor processing, which is a very important issue in shared virtual memory machines. A scientific application was implemented on a KSR-1 parallel computer. Experimental and theoretical results show that the generalized speedup is distinct from the traditional speedup and provides a more reasonable measurement. In the study of different speedups, various causes of superlinear speedup are also presented.
VLBI-resolution radio-map algorithms: Performance analysis of different levels of data-sharing on multi-socket, multi-core architectures

NASA Astrophysics Data System (ADS)

Tabik, S.; Romero, L. F.; Mimica, P.; Plata, O.; Zapata, E. L.

2012-09-01

A broad area in astronomy focuses on simulating extragalactic objects based on Very Long Baseline Interferometry (VLBI) radio-maps. Several algorithms in this scope simulate what would be the observed radio-maps if emitted from a predefined extragalactic object. This work analyzes the performance and scaling of this kind of algorithms on multi-socket, multi-core architectures. In particular, we evaluate a sharing approach, a privatizing approach and a hybrid approach on systems with complex memory hierarchy that includes shared Last Level Cache (LLC). In addition, we investigate which manual processes can be systematized and then automated in future works. The experiments show that the data-privatizing model scales efficiently on medium scale multi-socket, multi-core systems (up to 48 cores) while regardless of algorithmic and scheduling optimizations, the sharing approach is unable to reach acceptable scalability on more than one socket. However, the hybrid model with a specific level of data-sharing provides the best scalability over all used multi-socket, multi-core systems.
A simple modern correctness condition for a space-based high-performance multiprocessor

NASA Technical Reports Server (NTRS)

Probst, David K.; Li, Hon F.

1992-01-01

A number of U.S. national programs, including space-based detection of ballistic missile launches, envisage putting significant computing power into space. Given sufficient progress in low-power VLSI, multichip-module packaging and liquid-cooling technologies, we will see design of high-performance multiprocessors for individual satellites. In very high speed implementations, performance depends critically on tolerating large latencies in interprocessor communication; without latency tolerance, performance is limited by the vastly differing time scales in processor and data-memory modules, including interconnect times. The modern approach to tolerating remote-communication cost in scalable, shared-memory multiprocessors is to use a multithreaded architecture, and alter the semantics of shared memory slightly, at the price of forcing the programmer either to reason about program correctness in a relaxed consistency model or to agree to program in a constrained style. The literature on multiprocessor correctness conditions has become increasingly complex, and sometimes confusing, which may hinder its practical application. We propose a simple modern correctness condition for a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and the parallel programming system.
Static Memory Deduplication for Performance Optimization in Cloud Computing.

PubMed

Jia, Gangyong; Han, Guangjie; Wang, Hao; Yang, Xuan

2017-04-27

In a cloud computing environment, the number of virtual machines (VMs) on a single physical server and the number of applications running on each VM are continuously growing. This has led to an enormous increase in the demand of memory capacity and subsequent increase in the energy consumption in the cloud. Lack of enough memory has become a major bottleneck for scalability and performance of virtualization interfaces in cloud computing. To address this problem, memory deduplication techniques which reduce memory demand through page sharing are being adopted. However, such techniques suffer from overheads in terms of number of online comparisons required for the memory deduplication. In this paper, we propose a static memory deduplication (SMD) technique which can reduce memory capacity requirement and provide performance optimization in cloud computing. The main innovation of SMD is that the process of page detection is performed offline, thus potentially reducing the performance cost, especially in terms of response time. In SMD, page comparisons are restricted to the code segment, which has the highest shared content. Our experimental results show that SMD efficiently reduces memory capacity requirement and improves performance. We demonstrate that, compared to other approaches, the cost in terms of the response time is negligible.
Static Memory Deduplication for Performance Optimization in Cloud Computing

PubMed Central

Jia, Gangyong; Han, Guangjie; Wang, Hao; Yang, Xuan

2017-01-01

In a cloud computing environment, the number of virtual machines (VMs) on a single physical server and the number of applications running on each VM are continuously growing. This has led to an enormous increase in the demand of memory capacity and subsequent increase in the energy consumption in the cloud. Lack of enough memory has become a major bottleneck for scalability and performance of virtualization interfaces in cloud computing. To address this problem, memory deduplication techniques which reduce memory demand through page sharing are being adopted. However, such techniques suffer from overheads in terms of number of online comparisons required for the memory deduplication. In this paper, we propose a static memory deduplication (SMD) technique which can reduce memory capacity requirement and provide performance optimization in cloud computing. The main innovation of SMD is that the process of page detection is performed offline, thus potentially reducing the performance cost, especially in terms of response time. In SMD, page comparisons are restricted to the code segment, which has the highest shared content. Our experimental results show that SMD efficiently reduces memory capacity requirement and improves performance. We demonstrate that, compared to other approaches, the cost in terms of the response time is negligible. PMID:28448434
Solutions and debugging for data consistency in multiprocessors with noncoherent caches

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bernstein, D.; Mendelson, B.; Breternitz, M. Jr.

1995-02-01

We analyze two important problems that arise in shared-memory multiprocessor systems. The stale data problem involves ensuring that data items in local memory of individual processors are current, independent of writes done by other processors. False sharing occurs when two processors have copies of the same shared data block but update different portions of the block. The false sharing problem involves guaranteeing that subsequent writes are properly combined. In modern architectures these problems are usually solved in hardware, by exploiting mechanisms for hardware controlled cache consistency. This leads to more expensive and nonscalable designs. Therefore, we are concentrating on softwaremore » methods for ensuring cache consistency that would allow for affordable and scalable multiprocessing systems. Unfortunately, providing software control is nontrivial, both for the compiler writer and for the application programmer. For this reason we are developing a debugging environment that will facilitate the development of compiler-based techniques and will help the programmer to tune his or her application using explicit cache management mechanisms. We extend the notion of a race condition for IBM Shared Memory System POWER/4, taking into consideration its noncoherent caches, and propose techniques for detection of false sharing problems. Identification of the stale data problem is discussed as well, and solutions are suggested.« less
A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

DOE PAGES

Song, Fengguang; Dongarra, Jack

2014-10-01

Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, in this paper we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasksmore » without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Finally, our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs.« less
A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Song, Fengguang; Dongarra, Jack

Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, in this paper we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasksmore » without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Finally, our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs.« less

Importance of balanced architectures in the design of high-performance imaging systems

NASA Astrophysics Data System (ADS)

Sgro, Joseph A.; Stanton, Paul C.

1999-03-01

Imaging systems employed in demanding military and industrial applications, such as automatic target recognition and computer vision, typically require real-time high-performance computing resources. While high- performances computing systems have traditionally relied on proprietary architectures and custom components, recent advances in high performance general-purpose microprocessor technology have produced an abundance of low cost components suitable for use in high-performance computing systems. A common pitfall in the design of high performance imaging system, particularly systems employing scalable multiprocessor architectures, is the failure to balance computational and memory bandwidth. The performance of standard cluster designs, for example, in which several processors share a common memory bus, is typically constrained by memory bandwidth. The symptom characteristic of this problem is failure to the performance of the system to scale as more processors are added. The problem becomes exacerbated if I/O and memory functions share the same bus. The recent introduction of microprocessors with large internal caches and high performance external memory interfaces makes it practical to design high performance imaging system with balanced computational and memory bandwidth. Real word examples of such designs will be presented, along with a discussion of adapting algorithm design to best utilize available memory bandwidth.
Avoiding and tolerating latency in large-scale next-generation shared-memory multiprocessors

NASA Technical Reports Server (NTRS)

Probst, David K.

1993-01-01

A scalable solution to the memory-latency problem is necessary to prevent the large latencies of synchronization and memory operations inherent in large-scale shared-memory multiprocessors from reducing high performance. We distinguish latency avoidance and latency tolerance. Latency is avoided when data is brought to nearby locales for future reference. Latency is tolerated when references are overlapped with other computation. Latency-avoiding locales include: processor registers, data caches used temporally, and nearby memory modules. Tolerating communication latency requires parallelism, allowing the overlap of communication and computation. Latency-tolerating techniques include: vector pipelining, data caches used spatially, prefetching in various forms, and multithreading in various forms. Relaxing the consistency model permits increased use of avoidance and tolerance techniques. Each model is a mapping from the program text to sets of partial orders on program operations; it is a convention about which temporal precedences among program operations are necessary. Information about temporal locality and parallelism constrains the use of avoidance and tolerance techniques. Suitable architectural primitives and compiler technology are required to exploit the increased freedom to reorder and overlap operations in relaxed models.
Performance prediction: A case study using a multi-ring KSR-1 machine

NASA Technical Reports Server (NTRS)

Sun, Xian-He; Zhu, Jianping

1995-01-01

While computers with tens of thousands of processors have successfully delivered high performance power for solving some of the so-called 'grand-challenge' applications, the notion of scalability is becoming an important metric in the evaluation of parallel machine architectures and algorithms. In this study, the prediction of scalability and its application are carefully investigated. A simple formula is presented to show the relation between scalability, single processor computing power, and degradation of parallelism. A case study is conducted on a multi-ring KSR1 shared virtual memory machine. Experimental and theoretical results show that the influence of topology variation of an architecture is predictable. Therefore, the performance of an algorithm on a sophisticated, heirarchical architecture can be predicted and the best algorithm-machine combination can be selected for a given application.
Scalable Triadic Analysis of Large-Scale Graphs: Multi-Core vs. Multi-Processor vs. Multi-Threaded Shared Memory Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chin, George; Marquez, Andres; Choudhury, Sutanay

2012-09-01

Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less
Towards Scalable 1024 Processor Shared Memory Systems

NASA Technical Reports Server (NTRS)

Ciotti, Robert B.; Thigpen, William W. (Technical Monitor)

2001-01-01

Over the past 3 years, NASA Ames has been involved in a cooperative effort with SGI to develop the largest single system image systems available. Currently a 1024 Origin3OOO is under development, with first boot expected later in the summer of 2001. This paper discusses some early results with a 512p Origin3OOO system and some arcane IRIX system calls that can dramatically improve scaling performance.
Parallel Navier-Stokes computations on shared and distributed memory architectures

NASA Technical Reports Server (NTRS)

Hayder, M. Ehtesham; Jayasimha, D. N.; Pillay, Sasi Kumar

1995-01-01

We study a high order finite difference scheme to solve the time accurate flow field of a jet using the compressible Navier-Stokes equations. As part of our ongoing efforts, we have implemented our numerical model on three parallel computing platforms to study the computational, communication, and scalability characteristics. The platforms chosen for this study are a cluster of workstations connected through fast networks (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and a distributed memory multiprocessor (the IBM SPI). Our focus in this study is on the LACE testbed. We present some results for the Cray YMP and the IBM SP1 mainly for comparison purposes. On the LACE testbed, we study: (1) the communication characteristics of Ethernet, FDDI, and the ALLNODE networks and (2) the overheads induced by the PVM message passing library used for parallelizing the application. We demonstrate that clustering of workstations is effective and has the potential to be computationally competitive with supercomputers at a fraction of the cost.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Janjusic, Tommy; Kartsaklis, Christos

Memory scalability is an enduring problem and bottleneck that plagues many parallel codes. Parallel codes designed for High Performance Systems are typically designed over the span of several, and in some instances 10+, years. As a result, optimization practices which were appropriate for earlier systems may no longer be valid and thus require careful optimization consideration. Specifically, parallel codes whose memory footprint is a function of their scalability must be carefully considered for future exa-scale systems. In this paper we present a methodology and tool to study the memory scalability of parallel codes. Using our methodology we evaluate an applicationmore » s memory footprint as a function of scalability, which we coined memory efficiency, and describe our results. In particular, using our in-house tools we can pinpoint the specific application components which contribute to the application s overall memory foot-print (application data- structures, libraries, etc.).« less
Generation of multiphoton entangled quantum states by means of integrated frequency combs.

PubMed

Reimer, Christian; Kues, Michael; Roztocki, Piotr; Wetzel, Benjamin; Grazioso, Fabio; Little, Brent E; Chu, Sai T; Johnston, Tudor; Bromberg, Yaron; Caspani, Lucia; Moss, David J; Morandotti, Roberto

2016-03-11

Complex optical photon states with entanglement shared among several modes are critical to improving our fundamental understanding of quantum mechanics and have applications for quantum information processing, imaging, and microscopy. We demonstrate that optical integrated Kerr frequency combs can be used to generate several bi- and multiphoton entangled qubits, with direct applications for quantum communication and computation. Our method is compatible with contemporary fiber and quantum memory infrastructures and with chip-scale semiconductor technology, enabling compact, low-cost, and scalable implementations. The exploitation of integrated Kerr frequency combs, with their ability to generate multiple, customizable, and complex quantum states, can provide a scalable, practical, and compact platform for quantum technologies. Copyright © 2016, American Association for the Advancement of Science.
Scalable Motion Estimation Processor Core for Multimedia System-on-Chip Applications

NASA Astrophysics Data System (ADS)

Lai, Yeong-Kang; Hsieh, Tian-En; Chen, Lien-Fei

2007-04-01

In this paper, we describe a high-throughput and scalable motion estimation processor architecture for multimedia system-on-chip applications. The number of processing elements (PEs) is scalable according to the variable algorithm parameters and the performance required for different applications. Using the PE rings efficiently and an intelligent memory-interleaving organization, the efficiency of the architecture can be increased. Moreover, using efficient on-chip memories and a data management technique can effectively decrease the power consumption and memory bandwidth. Techniques for reducing the number of interconnections and external memory accesses are also presented. Our results demonstrate that the proposed scalable PE-ringed architecture is a flexible and high-performance processor core in multimedia system-on-chip applications.
The Automatic Parallelisation of Scientific Application Codes Using a Computer Aided Parallelisation Toolkit

NASA Technical Reports Server (NTRS)

Ierotheou, C.; Johnson, S.; Leggett, P.; Cross, M.; Evans, E.; Jin, Hao-Qiang; Frumkin, M.; Yan, J.; Biegel, Bryan (Technical Monitor)

2001-01-01

The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. Historically, the lack of a programming standard for using directives and the rather limited performance due to scalability have affected the take-up of this programming model approach. Significant progress has been made in hardware and software technologies, as a result the performance of parallel programs with compiler directives has also made improvements. The introduction of an industrial standard for shared-memory programming with directives, OpenMP, has also addressed the issue of portability. In this study, we have extended the computer aided parallelization toolkit (developed at the University of Greenwich), to automatically generate OpenMP based parallel programs with nominal user assistance. We outline the way in which loop types are categorized and how efficient OpenMP directives can be defined and placed using the in-depth interprocedural analysis that is carried out by the toolkit. We also discuss the application of the toolkit on the NAS Parallel Benchmarks and a number of real-world application codes. This work not only demonstrates the great potential of using the toolkit to quickly parallelize serial programs but also the good performance achievable on up to 300 processors for hybrid message passing and directive-based parallelizations.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Ibrahim, Khaled Z.; Epifanovsky, Evgeny; Williams, Samuel W.

Coupled-cluster methods provide highly accurate models of molecular structure by explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix-matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts tomore » extend the Libtensor framework to work in the distributed memory environment in a scalable and energy efficient manner. We achieve up to 240 speedup compared with the best optimized shared memory implementation. We attain scalability to hundreds of thousands of compute cores on three distributed-memory architectures, (Cray XC30&XC40, BlueGene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance. Nevertheless, we preserve a uni ed interface to both programming models to maintain the productivity of computational quantum chemists.« less
3D Kirchhoff depth migration algorithm: A new scalable approach for parallelization on multicore CPU based cluster

NASA Astrophysics Data System (ADS)

Rastogi, Richa; Londhe, Ashutosh; Srivastava, Abhishek; Sirasala, Kirannmayi M.; Khonde, Kiran

2017-03-01

In this article, a new scalable 3D Kirchhoff depth migration algorithm is presented on state of the art multicore CPU based cluster. Parallelization of 3D Kirchhoff depth migration is challenging due to its high demand of compute time, memory, storage and I/O along with the need of their effective management. The most resource intensive modules of the algorithm are traveltime calculations and migration summation which exhibit an inherent trade off between compute time and other resources. The parallelization strategy of the algorithm largely depends on the storage of calculated traveltimes and its feeding mechanism to the migration process. The presented work is an extension of our previous work, wherein a 3D Kirchhoff depth migration application for multicore CPU based parallel system had been developed. Recently, we have worked on improving parallel performance of this application by re-designing the parallelization approach. The new algorithm is capable to efficiently migrate both prestack and poststack 3D data. It exhibits flexibility for migrating large number of traces within the available node memory and with minimal requirement of storage, I/O and inter-node communication. The resultant application is tested using 3D Overthrust data on PARAM Yuva II, which is a Xeon E5-2670 based multicore CPU cluster with 16 cores/node and 64 GB shared memory. Parallel performance of the algorithm is studied using different numerical experiments and the scalability results show striking improvement over its previous version. An impressive 49.05X speedup with 76.64% efficiency is achieved for 3D prestack data and 32.00X speedup with 50.00% efficiency for 3D poststack data, using 64 nodes. The results also demonstrate the effectiveness and robustness of the improved algorithm with high scalability and efficiency on a multicore CPU cluster.
Scalability Issues for Remote Sensing Infrastructure: A Case Study.

PubMed

Liu, Yang; Picard, Sean; Williamson, Carey

2017-04-29

For the past decade, a team of University of Calgary researchers has operated a large "sensor Web" to collect, analyze, and share scientific data from remote measurement instruments across northern Canada. This sensor Web receives real-time data streams from over a thousand Internet-connected sensors, with a particular emphasis on environmental data (e.g., space weather, auroral phenomena, atmospheric imaging). Through research collaborations, we had the opportunity to evaluate the performance and scalability of their remote sensing infrastructure. This article reports the lessons learned from our study, which considered both data collection and data dissemination aspects of their system. On the data collection front, we used benchmarking techniques to identify and fix a performance bottleneck in the system's memory management for TCP data streams, while also improving system efficiency on multi-core architectures. On the data dissemination front, we used passive and active network traffic measurements to identify and reduce excessive network traffic from the Web robots and JavaScript techniques used for data sharing. While our results are from one specific sensor Web system, the lessons learned may apply to other scientific Web sites with remote sensing infrastructure.
High-Performance Computation of Distributed-Memory Parallel 3D Voronoi and Delaunay Tessellation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Peterka, Tom; Morozov, Dmitriy; Phillips, Carolyn

2014-11-14

Computing a Voronoi or Delaunay tessellation from a set of points is a core part of the analysis of many simulated and measured datasets: N-body simulations, molecular dynamics codes, and LIDAR point clouds are just a few examples. Such computational geometry methods are common in data analysis and visualization; but as the scale of simulations and observations surpasses billions of particles, the existing serial and shared-memory algorithms no longer suffice. A distributed-memory scalable parallel algorithm is the only feasible approach. The primary contribution of this paper is a new parallel Delaunay and Voronoi tessellation algorithm that automatically determines which neighbormore » points need to be exchanged among the subdomains of a spatial decomposition. Other contributions include periodic and wall boundary conditions, comparison of our method using two popular serial libraries, and application to numerous science datasets.« less
An Evaluation of Architectural Platforms for Parallel Navier-Stokes Computations

NASA Technical Reports Server (NTRS)

Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.

1996-01-01

We study the computational, communication, and scalability characteristics of a computational fluid dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architecture platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and distributed memory multiprocessors with different topologies - the IBM SP and the Cray T3D. We investigate the impact of various networks connecting the cluster of workstations on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms

NASA Technical Reports Server (NTRS)

Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.

1997-01-01

We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies-the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ibrahim, Khaled Z.; Epifanovsky, Evgeny; Williams, Samuel

Coupled-cluster methods provide highly accurate models of molecular structure through explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix–matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts tomore » extend the Libtensor framework to work in the distributed memory environment in a scalable and energy-efficient manner. We achieve up to 240× speedup compared with the optimized shared memory implementation of Libtensor. We attain scalability to hundreds of thousands of compute cores on three distributed-memory architectures (Cray XC30 and XC40, and IBM Blue Gene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance, tasking and bulk synchronous models. Nevertheless, we preserve a unified interface to both programming models to maintain the productivity of computational quantum chemists.« less
Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

DOE PAGES

Ibrahim, Khaled Z.; Epifanovsky, Evgeny; Williams, Samuel; ...

2017-03-08

Coupled-cluster methods provide highly accurate models of molecular structure through explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix–matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts tomore » extend the Libtensor framework to work in the distributed memory environment in a scalable and energy-efficient manner. We achieve up to 240× speedup compared with the optimized shared memory implementation of Libtensor. We attain scalability to hundreds of thousands of compute cores on three distributed-memory architectures (Cray XC30 and XC40, and IBM Blue Gene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance, tasking and bulk synchronous models. Nevertheless, we preserve a unified interface to both programming models to maintain the productivity of computational quantum chemists.« less
Scalable Parallel Density-based Clustering and Applications

NASA Astrophysics Data System (ADS)

Patwary, Mostofa Ali

2014-04-01

Recently, density-based clustering algorithms (DBSCAN and OPTICS) have gotten significant attention of the scientific community due to their unique capability of discovering arbitrary shaped clusters and eliminating noise data. These algorithms have several applications, which require high performance computing, including finding halos and subhalos (clusters) from massive cosmology data in astrophysics, analyzing satellite images, X-ray crystallography, and anomaly detection. However, parallelization of these algorithms are extremely challenging as they exhibit inherent sequential data access order, unbalanced workload resulting in low parallel efficiency. To break the data access sequentiality and to achieve high parallelism, we develop new parallel algorithms, both for DBSCAN and OPTICS, designed using graph algorithmic techniques. For example, our parallel DBSCAN algorithm exploits the similarities between DBSCAN and computing connected components. Using datasets containing up to a billion floating point numbers, we show that our parallel density-based clustering algorithms significantly outperform the existing algorithms, achieving speedups up to 27.5 on 40 cores on shared memory architecture and speedups up to 5,765 using 8,192 cores on distributed memory architecture. In our experiments, we found that while achieving the scalability, our algorithms produce clustering results with comparable quality to the classical algorithms.
Networking and AI systems: Requirements and benefits

NASA Technical Reports Server (NTRS)

1988-01-01

The price performance benefits of network systems is well documented. The ability to share expensive resources sold timesharing for mainframes, department clusters of minicomputers, and now local area networks of workstations and servers. In the process, other fundamental system requirements emerged. These have now been generalized with open system requirements for hardware, software, applications and tools. The ability to interconnect a variety of vendor products has led to a specification of interfaces that allow new techniques to extend existing systems for new and exciting applications. As an example of the message passing system, local area networks provide a testbed for many of the issues addressed by future concurrent architectures: synchronization, load balancing, fault tolerance and scalability. Gold Hill has been working with a number of vendors on distributed architectures that range from a network of workstations to a hypercube of microprocessors with distributed memory. Results from early applications are promising both for performance and scalability.

Scalable quantum memory in the ultrastrong coupling regime.

PubMed

Kyaw, T H; Felicetti, S; Romero, G; Solano, E; Kwek, L-C

2015-03-02

Circuit quantum electrodynamics, consisting of superconducting artificial atoms coupled to on-chip resonators, represents a prime candidate to implement the scalable quantum computing architecture because of the presence of good tunability and controllability. Furthermore, recent advances have pushed the technology towards the ultrastrong coupling regime of light-matter interaction, where the qubit-resonator coupling strength reaches a considerable fraction of the resonator frequency. Here, we propose a qubit-resonator system operating in that regime, as a quantum memory device and study the storage and retrieval of quantum information in and from the Z2 parity-protected quantum memory, within experimentally feasible schemes. We are also convinced that our proposal might pave a way to realize a scalable quantum random-access memory due to its fast storage and readout performances.
Scalable quantum memory in the ultrastrong coupling regime

PubMed Central

Kyaw, T. H.; Felicetti, S.; Romero, G.; Solano, E.; Kwek, L.-C.

2015-01-01

Circuit quantum electrodynamics, consisting of superconducting artificial atoms coupled to on-chip resonators, represents a prime candidate to implement the scalable quantum computing architecture because of the presence of good tunability and controllability. Furthermore, recent advances have pushed the technology towards the ultrastrong coupling regime of light-matter interaction, where the qubit-resonator coupling strength reaches a considerable fraction of the resonator frequency. Here, we propose a qubit-resonator system operating in that regime, as a quantum memory device and study the storage and retrieval of quantum information in and from the Z2 parity-protected quantum memory, within experimentally feasible schemes. We are also convinced that our proposal might pave a way to realize a scalable quantum random-access memory due to its fast storage and readout performances. PMID:25727251
Scalability Issues for Remote Sensing Infrastructure: A Case Study

PubMed Central

Liu, Yang; Picard, Sean; Williamson, Carey

2017-01-01

For the past decade, a team of University of Calgary researchers has operated a large “sensor Web” to collect, analyze, and share scientific data from remote measurement instruments across northern Canada. This sensor Web receives real-time data streams from over a thousand Internet-connected sensors, with a particular emphasis on environmental data (e.g., space weather, auroral phenomena, atmospheric imaging). Through research collaborations, we had the opportunity to evaluate the performance and scalability of their remote sensing infrastructure. This article reports the lessons learned from our study, which considered both data collection and data dissemination aspects of their system. On the data collection front, we used benchmarking techniques to identify and fix a performance bottleneck in the system’s memory management for TCP data streams, while also improving system efficiency on multi-core architectures. On the data dissemination front, we used passive and active network traffic measurements to identify and reduce excessive network traffic from the Web robots and JavaScript techniques used for data sharing. While our results are from one specific sensor Web system, the lessons learned may apply to other scientific Web sites with remote sensing infrastructure. PMID:28468262
Tuning collective communication for Partitioned Global Address Space programming models

DOE PAGES

Nishtala, Rajesh; Zheng, Yili; Hargrove, Paul H.; ...

2011-06-12

Partitioned Global Address Space (PGAS) languages offer programmers the convenience of a shared memory programming style combined with locality control necessary to run on large-scale distributed memory systems. Even within a PGAS language programmers often need to perform global communication operations such as broadcasts or reductions, which are best performed as collective operations in which a group of threads work together to perform the operation. In this study we consider the problem of implementing collective communication within PGAS languages and explore some of the design trade-offs in both the interface and implementation. In particular, PGAS collectives have semantic issues thatmore » are different than in send–receive style message passing programs, and different implementation approaches that take advantage of the one-sided communication style in these languages. We present an implementation framework for PGAS collectives as part of the GASNet communication layer, which supports shared memory, distributed memory and hybrids. The framework supports a broad set of algorithms for each collective, over which the implementation may be automatically tuned. In conclusion, we demonstrate the benefit of optimized GASNet collectives using application benchmarks written in UPC, and demonstrate that the GASNet collectives can deliver scalable performance on a variety of state-of-the-art parallel machines including a Cray XT4, an IBM BlueGene/P, and a Sun Constellation system with InfiniBand interconnect.« less
Scalable printed electronics: an organic decoder addressing ferroelectric non-volatile memory.

PubMed

Ng, Tse Nga; Schwartz, David E; Lavery, Leah L; Whiting, Gregory L; Russo, Beverly; Krusor, Brent; Veres, Janos; Bröms, Per; Herlogsson, Lars; Alam, Naveed; Hagel, Olle; Nilsson, Jakob; Karlsson, Christer

2012-01-01

Scalable circuits of organic logic and memory are realized using all-additive printing processes. A 3-bit organic complementary decoder is fabricated and used to read and write non-volatile, rewritable ferroelectric memory. The decoder-memory array is patterned by inkjet and gravure printing on flexible plastics. Simulation models for the organic transistors are developed, enabling circuit designs tolerant of the variations in printed devices. We explain the key design rules in fabrication of complex printed circuits and elucidate the performance requirements of materials and devices for reliable organic digital logic.
Using Coarrays to Parallelize Legacy Fortran Applications: Strategy and Case Study

DOE PAGES

Radhakrishnan, Hari; Rouson, Damian W. I.; Morris, Karla; ...

2015-01-01

This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were donemore » using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.« less
Cost aware cache replacement policy in shared last-level cache for hybrid memory based fog computing

NASA Astrophysics Data System (ADS)

Jia, Gangyong; Han, Guangjie; Wang, Hao; Wang, Feng

2018-04-01

Fog computing requires a large main memory capacity to decrease latency and increase the Quality of Service (QoS). However, dynamic random access memory (DRAM), the commonly used random access memory, cannot be included into a fog computing system due to its high consumption of power. In recent years, non-volatile memories (NVM) such as Phase-Change Memory (PCM) and Spin-transfer torque RAM (STT-RAM) with their low power consumption have emerged to replace DRAM. Moreover, the currently proposed hybrid main memory, consisting of both DRAM and NVM, have shown promising advantages in terms of scalability and power consumption. However, the drawbacks of NVM, such as long read/write latency give rise to potential problems leading to asymmetric cache misses in the hybrid main memory. Current last level cache (LLC) policies are based on the unified miss cost, and result in poor performance in LLC and add to the cost of using NVM. In order to minimize the cache miss cost in the hybrid main memory, we propose a cost aware cache replacement policy (CACRP) that reduces the number of cache misses from NVM and improves the cache performance for a hybrid memory system. Experimental results show that our CACRP behaves better in LLC performance, improving performance up to 43.6% (15.5% on average) compared to LRU.
Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Biswas, Rupak

1999-01-01

The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.
Scalable printed electronics: an organic decoder addressing ferroelectric non-volatile memory

PubMed Central

Ng, Tse Nga; Schwartz, David E.; Lavery, Leah L.; Whiting, Gregory L.; Russo, Beverly; Krusor, Brent; Veres, Janos; Bröms, Per; Herlogsson, Lars; Alam, Naveed; Hagel, Olle; Nilsson, Jakob; Karlsson, Christer

2012-01-01

Scalable circuits of organic logic and memory are realized using all-additive printing processes. A 3-bit organic complementary decoder is fabricated and used to read and write non-volatile, rewritable ferroelectric memory. The decoder-memory array is patterned by inkjet and gravure printing on flexible plastics. Simulation models for the organic transistors are developed, enabling circuit designs tolerant of the variations in printed devices. We explain the key design rules in fabrication of complex printed circuits and elucidate the performance requirements of materials and devices for reliable organic digital logic. PMID:22900143
OligoIS: Scalable Instance Selection for Class-Imbalanced Data Sets.

PubMed

García-Pedrajas, Nicolás; Perez-Rodríguez, Javier; de Haro-García, Aida

2013-02-01

In current research, an enormous amount of information is constantly being produced, which poses a challenge for data mining algorithms. Many of the problems in extremely active research areas, such as bioinformatics, security and intrusion detection, or text mining, share the following two features: large data sets and class-imbalanced distribution of samples. Although many methods have been proposed for dealing with class-imbalanced data sets, most of these methods are not scalable to the very large data sets common to those research fields. In this paper, we propose a new approach to dealing with the class-imbalance problem that is scalable to data sets with many millions of instances and hundreds of features. This proposal is based on the divide-and-conquer principle combined with application of the selection process to balanced subsets of the whole data set. This divide-and-conquer principle allows the execution of the algorithm in linear time. Furthermore, the proposed method is easy to implement using a parallel environment and can work without loading the whole data set into memory. Using 40 class-imbalanced medium-sized data sets, we will demonstrate our method's ability to improve the results of state-of-the-art instance selection methods for class-imbalanced data sets. Using three very large data sets, we will show the scalability of our proposal to millions of instances and hundreds of features.
Performance and scalability evaluation of "Big Memory" on Blue Gene Linux.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoshii, K.; Iskra, K.; Naik, H.

2011-05-01

We address memory performance issues observed in Blue Gene Linux and discuss the design and implementation of 'Big Memory' - an alternative, transparent memory space introduced to eliminate the memory performance issues. We evaluate the performance of Big Memory using custom memory benchmarks, NAS Parallel Benchmarks, and the Parallel Ocean Program, at a scale of up to 4,096 nodes. We find that Big Memory successfully resolves the performance issues normally encountered in Blue Gene Linux. For the ocean simulation program, we even find that Linux with Big Memory provides better scalability than does the lightweight compute node kernel designed solelymore » for high-performance applications. Originally intended exclusively for compute node tasks, our new memory subsystem dramatically improves the performance of certain I/O node applications as well. We demonstrate this performance using the central processor of the LOw Frequency ARray radio telescope as an example.« less
Modern gyrokinetic particle-in-cell simulation of fusion plasmas on top supercomputers

DOE PAGES

Wang, Bei; Ethier, Stephane; Tang, William; ...

2017-06-29

The Gyrokinetic Toroidal Code at Princeton (GTC-P) is a highly scalable and portable particle-in-cell (PIC) code. It solves the 5D Vlasov-Poisson equation featuring efficient utilization of modern parallel computer architectures at the petascale and beyond. Motivated by the goal of developing a modern code capable of dealing with the physics challenge of increasing problem size with sufficient resolution, new thread-level optimizations have been introduced as well as a key additional domain decomposition. GTC-P's multiple levels of parallelism, including inter-node 2D domain decomposition and particle decomposition, as well as intra-node shared memory partition and vectorization have enabled pushing the scalability ofmore » the PIC method to extreme computational scales. In this paper, we describe the methods developed to build a highly parallelized PIC code across a broad range of supercomputer designs. This particularly includes implementations on heterogeneous systems using NVIDIA GPU accelerators and Intel Xeon Phi (MIC) co-processors and performance comparisons with state-of-the-art homogeneous HPC systems such as Blue Gene/Q. New discovery science capabilities in the magnetic fusion energy application domain are enabled, including investigations of Ion-Temperature-Gradient (ITG) driven turbulence simulations with unprecedented spatial resolution and long temporal duration. Performance studies with realistic fusion experimental parameters are carried out on multiple supercomputing systems spanning a wide range of cache capacities, cache-sharing configurations, memory bandwidth, interconnects and network topologies. These performance comparisons using a realistic discovery-science-capable domain application code provide valuable insights on optimization techniques across one of the broadest sets of current high-end computing platforms worldwide.« less
Modern gyrokinetic particle-in-cell simulation of fusion plasmas on top supercomputers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wang, Bei; Ethier, Stephane; Tang, William

The Gyrokinetic Toroidal Code at Princeton (GTC-P) is a highly scalable and portable particle-in-cell (PIC) code. It solves the 5D Vlasov-Poisson equation featuring efficient utilization of modern parallel computer architectures at the petascale and beyond. Motivated by the goal of developing a modern code capable of dealing with the physics challenge of increasing problem size with sufficient resolution, new thread-level optimizations have been introduced as well as a key additional domain decomposition. GTC-P's multiple levels of parallelism, including inter-node 2D domain decomposition and particle decomposition, as well as intra-node shared memory partition and vectorization have enabled pushing the scalability ofmore » the PIC method to extreme computational scales. In this paper, we describe the methods developed to build a highly parallelized PIC code across a broad range of supercomputer designs. This particularly includes implementations on heterogeneous systems using NVIDIA GPU accelerators and Intel Xeon Phi (MIC) co-processors and performance comparisons with state-of-the-art homogeneous HPC systems such as Blue Gene/Q. New discovery science capabilities in the magnetic fusion energy application domain are enabled, including investigations of Ion-Temperature-Gradient (ITG) driven turbulence simulations with unprecedented spatial resolution and long temporal duration. Performance studies with realistic fusion experimental parameters are carried out on multiple supercomputing systems spanning a wide range of cache capacities, cache-sharing configurations, memory bandwidth, interconnects and network topologies. These performance comparisons using a realistic discovery-science-capable domain application code provide valuable insights on optimization techniques across one of the broadest sets of current high-end computing platforms worldwide.« less
The Simulation of Read-time Scalable Coherent Interface

NASA Technical Reports Server (NTRS)

Li, Qiang; Grant, Terry; Grover, Radhika S.

1997-01-01

Scalable Coherent Interface (SCI, IEEE/ANSI Std 1596-1992) (SCI1, SCI2) is a high performance interconnect for shared memory multiprocessor systems. In this project we investigate an SCI Real Time Protocols (RTSCI1) using Directed Flow Control Symbols. We studied the issues of efficient generation of control symbols, and created a simulation model of the protocol on a ring-based SCI system. This report presents the results of the study. The project has been implemented using SES/Workbench. The details that follow encompass aspects of both SCI and Flow Control Protocols, as well as the effect of realistic client/server processing delay. The report is organized as follows. Section 2 provides a description of the simulation model. Section 3 describes the protocol implementation details. The next three sections of the report elaborate on the workload, results and conclusions. Appended to the report is a description of the tool, SES/Workbench, used in our simulation, and internal details of our implementation of the protocol.
Automatic Parallelization of Numerical Python Applications using the Global Arrays Toolkit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Daily, Jeffrey A.; Lewis, Robert R.

2011-11-30

Global Arrays is a software system from Pacific Northwest National Laboratory that enables an efficient, portable, and parallel shared-memory programming interface to manipulate distributed dense arrays. The NumPy module is the de facto standard for numerical calculation in the Python programming language, a language whose use is growing rapidly in the scientific and engineering communities. NumPy provides a powerful N-dimensional array class as well as other scientific computing capabilities. However, like the majority of the core Python modules, NumPy is inherently serial. Using a combination of Global Arrays and NumPy, we have reimplemented NumPy as a distributed drop-in replacement calledmore » Global Arrays in NumPy (GAiN). Serial NumPy applications can become parallel, scalable GAiN applications with only minor source code changes. Scalability studies of several different GAiN applications will be presented showing the utility of developing serial NumPy codes which can later run on more capable clusters or supercomputers.« less
Spaceborne Processor Array

NASA Technical Reports Server (NTRS)

Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

2008-01-01

A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.
Multiband Radio Frequency Interconnect (MRFI) Technology For Next Generation Mobile/Airborne Computing Systems

DTIC Science & Technology

2017-02-01

enable high scalability and reconfigurability for inter-CPU/Memory communications with an increased number of communication channels in frequency ...interconnect technology (MRFI) to enable high scalability and re-configurability for inter-CPU/Memory communications with an increased number of communication ...testing in the University of California, Los Angeles (UCLA) Center for High Frequency Electronics, and Dr. Afshin Momtaz at Broadcom Corporation for
MaMR: High-performance MapReduce programming model for material cloud applications

NASA Astrophysics Data System (ADS)

Jing, Weipeng; Tong, Danyu; Wang, Yangang; Wang, Jingyuan; Liu, Yaqiu; Zhao, Peng

2017-02-01

With the increasing data size in materials science, existing programming models no longer satisfy the application requirements. MapReduce is a programming model that enables the easy development of scalable parallel applications to process big data on cloud computing systems. However, this model does not directly support the processing of multiple related data, and the processing performance does not reflect the advantages of cloud computing. To enhance the capability of workflow applications in material data processing, we defined a programming model for material cloud applications that supports multiple different Map and Reduce functions running concurrently based on hybrid share-memory BSP called MaMR. An optimized data sharing strategy to supply the shared data to the different Map and Reduce stages was also designed. We added a new merge phase to MapReduce that can efficiently merge data from the map and reduce modules. Experiments showed that the model and framework present effective performance improvements compared to previous work.
Authentication and Key Establishment in Dynamic Wireless Sensor Networks

PubMed Central

Qiu, Ying; Zhou, Jianying; Baek, Joonsang; Lopez, Javier

2010-01-01

When a sensor node roams within a very large and distributed wireless sensor network, which consists of numerous sensor nodes, its routing path and neighborhood keep changing. In order to provide a high level of security in this environment, the moving sensor node needs to be authenticated to new neighboring nodes and a key established for secure communication. The paper proposes an efficient and scalable protocol to establish and update the authentication key in a dynamic wireless sensor network environment. The protocol guarantees that two sensor nodes share at least one key with probability 1 (100%) with less memory and energy cost, while not causing considerable communication overhead. PMID:22319321
Exploring Manycore Multinode Systems for Irregular Applications with FPGA Prototyping

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ceriani, Marco; Palermo, Gianluca; Secchi, Simone

We present a prototype of a multi-core architecture implemented on FPGA, designed to enable efficient execution of irregular applications on distributed shared memory machines, while maintaining high performance on regular workloads. The architecture is composed of off-the-shelf soft-core cores, local interconnection and memory interface, integrated with custom components that optimize it for irregular applications. It relies on three key elements: a global address space, multithreading, and fine-grained synchronization. Global addresses are scrambled to reduce the formation of network hot-spots, while the latency of the transactions is covered by integrating an hardware scheduler within the custom load/store buffers to take advantagemore » from the availability of multiple executions threads, increasing the efficiency in a transparent way to the application. We evaluated a dual node system irregular kernels showing scalability in the number of cores and threads.« less

Thermally efficient and highly scalable In2Se3 nanowire phase change memory

NASA Astrophysics Data System (ADS)

Jin, Bo; Kang, Daegun; Kim, Jungsik; Meyyappan, M.; Lee, Jeong-Soo

2013-04-01

The electrical characteristics of nonvolatile In2Se3 nanowire phase change memory are reported. Size-dependent memory switching behavior was observed in nanowires of varying diameters and the reduction in set/reset threshold voltage was as low as 3.45 V/6.25 V for a 60 nm nanowire, which is promising for highly scalable nanowire memory applications. Also, size-dependent thermal resistance of In2Se3 nanowire memory cells was estimated with values as high as 5.86×1013 and 1.04×106 K/W for a 60 nm nanowire memory cell in amorphous and crystalline phases, respectively. Such high thermal resistances are beneficial for improvement of thermal efficiency and thus reduction in programming power consumption based on Fourier's law. The evaluation of thermal resistance provides an avenue to develop thermally efficient memory cell architecture.
Analysis of scalability of high-performance 3D image processing platform for virtual colonoscopy

NASA Astrophysics Data System (ADS)

Yoshida, Hiroyuki; Wu, Yin; Cai, Wenli

2014-03-01

One of the key challenges in three-dimensional (3D) medical imaging is to enable the fast turn-around time, which is often required for interactive or real-time response. This inevitably requires not only high computational power but also high memory bandwidth due to the massive amount of data that need to be processed. For this purpose, we previously developed a software platform for high-performance 3D medical image processing, called HPC 3D-MIP platform, which employs increasingly available and affordable commodity computing systems such as the multicore, cluster, and cloud computing systems. To achieve scalable high-performance computing, the platform employed size-adaptive, distributable block volumes as a core data structure for efficient parallelization of a wide range of 3D-MIP algorithms, supported task scheduling for efficient load distribution and balancing, and consisted of a layered parallel software libraries that allow image processing applications to share the common functionalities. We evaluated the performance of the HPC 3D-MIP platform by applying it to computationally intensive processes in virtual colonoscopy. Experimental results showed a 12-fold performance improvement on a workstation with 12-core CPUs over the original sequential implementation of the processes, indicating the efficiency of the platform. Analysis of performance scalability based on the Amdahl's law for symmetric multicore chips showed the potential of a high performance scalability of the HPC 3DMIP platform when a larger number of cores is available.
Performance and scalability of Fourier domain optical coherence tomography acceleration using graphics processing units.

PubMed

Li, Jian; Bloch, Pavel; Xu, Jing; Sarunic, Marinko V; Shannon, Lesley

2011-05-01

Fourier domain optical coherence tomography (FD-OCT) provides faster line rates, better resolution, and higher sensitivity for noninvasive, in vivo biomedical imaging compared to traditional time domain OCT (TD-OCT). However, because the signal processing for FD-OCT is computationally intensive, real-time FD-OCT applications demand powerful computing platforms to deliver acceptable performance. Graphics processing units (GPUs) have been used as coprocessors to accelerate FD-OCT by leveraging their relatively simple programming model to exploit thread-level parallelism. Unfortunately, GPUs do not "share" memory with their host processors, requiring additional data transfers between the GPU and CPU. In this paper, we implement a complete FD-OCT accelerator on a consumer grade GPU/CPU platform. Our data acquisition system uses spectrometer-based detection and a dual-arm interferometer topology with numerical dispersion compensation for retinal imaging. We demonstrate that the maximum line rate is dictated by the memory transfer time and not the processing time due to the GPU platform's memory model. Finally, we discuss how the performance trends of GPU-based accelerators compare to the expected future requirements of FD-OCT data rates.
Resistive switching characteristics of polymer non-volatile memory devices in a scalable via-hole structure.

PubMed

Kim, Tae-Wook; Choi, Hyejung; Oh, Seung-Hwan; Jo, Minseok; Wang, Gunuk; Cho, Byungjin; Kim, Dong-Yu; Hwang, Hyunsang; Lee, Takhee

2009-01-14

The resistive switching characteristics of polyfluorene-derivative polymer material in a sub-micron scale via-hole device structure were investigated. The scalable via-hole sub-microstructure was fabricated using an e-beam lithographic technique. The polymer non-volatile memory devices varied in size from 40 x 40 microm(2) to 200 x 200 nm(2). From the scaling of junction size, the memory mechanism can be attributed to the space-charge-limited current with filamentary conduction. Sub-micron scale polymer memory devices showed excellent resistive switching behaviours such as a large ON/OFF ratio (I(ON)/I(OFF) approximately 10(4)), excellent device-to-device switching uniformity, good sweep endurance, and good retention times (more than 10,000 s). The successful operation of sub-micron scale memory devices of our polyfluorene-derivative polymer shows promise to fabricate high-density polymer memory devices.
Scalable Quantum Networks for Distributed Computing and Sensing

DTIC Science & Technology

2016-04-01

probabilistic measurement , so we developed quantum memories and guided-wave implementations of same, demonstrating controlled delay of a heralded single...Second, fundamental scalability requires a method to synchronize protocols based on quantum measurements , which are inherently probabilistic. To meet...AFRL-AFOSR-UK-TR-2016-0007 Scalable Quantum Networks for Distributed Computing and Sensing Ian Walmsley THE UNIVERSITY OF OXFORD Final Report 04/01
Scalable Architecture for Federated Translational Inquiries Network (SAFTINet) Technology Infrastructure for a Distributed Data Network

PubMed Central

Schilling, Lisa M.; Kwan, Bethany M.; Drolshagen, Charles T.; Hosokawa, Patrick W.; Brandt, Elias; Pace, Wilson D.; Uhrich, Christopher; Kamerick, Michael; Bunting, Aidan; Payne, Philip R.O.; Stephens, William E.; George, Joseph M.; Vance, Mark; Giacomini, Kelli; Braddy, Jason; Green, Mika K.; Kahn, Michael G.

2013-01-01

Introduction: Distributed Data Networks (DDNs) offer infrastructure solutions for sharing electronic health data from across disparate data sources to support comparative effectiveness research. Data sharing mechanisms must address technical and governance concerns stemming from network security and data disclosure laws and best practices, such as HIPAA. Methods: The Scalable Architecture for Federated Translational Inquiries Network (SAFTINet) deploys TRIAD grid technology, a common data model, detailed technical documentation, and custom software for data harmonization to facilitate data sharing in collaboration with stakeholders in the care of safety net populations. Data sharing partners host TRIAD grid nodes containing harmonized clinical data within their internal or hosted network environments. Authorized users can use a central web-based query system to request analytic data sets. Discussion: SAFTINet DDN infrastructure achieved a number of data sharing objectives, including scalable and sustainable systems for ensuring harmonized data structures and terminologies and secure distributed queries. Initial implementation challenges were resolved through iterative discussions, development and implementation of technical documentation, governance, and technology solutions. PMID:25848567
Scalable Architecture for Federated Translational Inquiries Network (SAFTINet) Technology Infrastructure for a Distributed Data Network.

PubMed

Schilling, Lisa M; Kwan, Bethany M; Drolshagen, Charles T; Hosokawa, Patrick W; Brandt, Elias; Pace, Wilson D; Uhrich, Christopher; Kamerick, Michael; Bunting, Aidan; Payne, Philip R O; Stephens, William E; George, Joseph M; Vance, Mark; Giacomini, Kelli; Braddy, Jason; Green, Mika K; Kahn, Michael G

2013-01-01

Distributed Data Networks (DDNs) offer infrastructure solutions for sharing electronic health data from across disparate data sources to support comparative effectiveness research. Data sharing mechanisms must address technical and governance concerns stemming from network security and data disclosure laws and best practices, such as HIPAA. The Scalable Architecture for Federated Translational Inquiries Network (SAFTINet) deploys TRIAD grid technology, a common data model, detailed technical documentation, and custom software for data harmonization to facilitate data sharing in collaboration with stakeholders in the care of safety net populations. Data sharing partners host TRIAD grid nodes containing harmonized clinical data within their internal or hosted network environments. Authorized users can use a central web-based query system to request analytic data sets. SAFTINet DDN infrastructure achieved a number of data sharing objectives, including scalable and sustainable systems for ensuring harmonized data structures and terminologies and secure distributed queries. Initial implementation challenges were resolved through iterative discussions, development and implementation of technical documentation, governance, and technology solutions.
Robust resistive memory devices using solution-processable metal-coordinated azo aromatics

NASA Astrophysics Data System (ADS)

Goswami, Sreetosh; Matula, Adam J.; Rath, Santi P.; Hedström, Svante; Saha, Surajit; Annamalai, Meenakshi; Sengupta, Debabrata; Patra, Abhijeet; Ghosh, Siddhartha; Jani, Hariom; Sarkar, Soumya; Motapothula, Mallikarjuna Rao; Nijhuis, Christian A.; Martin, Jens; Goswami, Sreebrata; Batista, Victor S.; Venkatesan, T.

2017-12-01

Non-volatile memories will play a decisive role in the next generation of digital technology. Flash memories are currently the key player in the field, yet they fail to meet the commercial demands of scalability and endurance. Resistive memory devices, and in particular memories based on low-cost, solution-processable and chemically tunable organic materials, are promising alternatives explored by the industry. However, to date, they have been lacking the performance and mechanistic understanding required for commercial translation. Here we report a resistive memory device based on a spin-coated active layer of a transition-metal complex, which shows high reproducibility (~350 devices), fast switching (<=30 ns), excellent endurance (~1012 cycles), stability (>106 s) and scalability (down to ~60 nm2). In situ Raman and ultraviolet-visible spectroscopy alongside spectroelectrochemistry and quantum chemical calculations demonstrate that the redox state of the ligands determines the switching states of the device whereas the counterions control the hysteresis. This insight may accelerate the technological deployment of organic resistive memories.
Preliminary basic performance analysis of the Cedar multiprocessor memory system

NASA Technical Reports Server (NTRS)

Gallivan, K.; Jalby, W.; Turner, S.; Veidenbaum, A.; Wijshoff, H.

1991-01-01

Some preliminary basic results on the performance of the Cedar multiprocessor memory system are presented. Empirical results are presented and used to calibrate a memory system simulator which is then used to discuss the scalability of the system.
Message Passing and Shared Address Space Parallelism on an SMP Cluster

NASA Technical Reports Server (NTRS)

Shan, Hongzhang; Singh, Jaswinder P.; Oliker, Leonid; Biswas, Rupak; Biegel, Bryan (Technical Monitor)

2002-01-01

Currently, message passing (MP) and shared address space (SAS) are the two leading parallel programming paradigms. MP has been standardized with MPI, and is the more common and mature approach; however, code development can be extremely difficult, especially for irregularly structured computations. SAS offers substantial ease of programming, but may suffer from performance limitations due to poor spatial locality and high protocol overhead. In this paper, we compare the performance of and the programming effort required for six applications under both programming models on a 32-processor PC-SMP cluster, a platform that is becoming increasingly attractive for high-end scientific computing. Our application suite consists of codes that typically do not exhibit scalable performance under shared-memory programming due to their high communication-to-computation ratios and/or complex communication patterns. Results indicate that SAS can achieve about half the parallel efficiency of MPI for most of our applications, while being competitive for the others. A hybrid MPI+SAS strategy shows only a small performance advantage over pure MPI in some cases. Finally, improved implementations of two MPI collective operations on PC-SMP clusters are presented.
Pseudo-orthogonalization of memory patterns for associative memory.

PubMed

Oku, Makito; Makino, Takaki; Aihara, Kazuyuki

2013-11-01

A new method for improving the storage capacity of associative memory models on a neural network is proposed. The storage capacity of the network increases in proportion to the network size in the case of random patterns, but, in general, the capacity suffers from correlation among memory patterns. Numerous solutions to this problem have been proposed so far, but their high computational cost limits their scalability. In this paper, we propose a novel and simple solution that is locally computable without any iteration. Our method involves XNOR masking of the original memory patterns with random patterns, and the masked patterns and masks are concatenated. The resulting decorrelated patterns allow higher storage capacity at the cost of the pattern length. Furthermore, the increase in the pattern length can be reduced through blockwise masking, which results in a small amount of capacity loss. Movie replay and image recognition are presented as examples to demonstrate the scalability of the proposed method.
GoFFish: A Sub-Graph Centric Framework for Large-Scale Graph Analytics1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Simmhan, Yogesh; Kumbhare, Alok; Wickramaarachchi, Charith

2014-08-25

Large scale graph processing is a major research area for Big Data exploration. Vertex centric programming models like Pregel are gaining traction due to their simple abstraction that allows for scalable execution on distributed systems naturally. However, there are limitations to this approach which cause vertex centric algorithms to under-perform due to poor compute to communication overhead ratio and slow convergence of iterative superstep. In this paper we introduce GoFFish a scalable sub-graph centric framework co-designed with a distributed persistent graph storage for large scale graph analytics on commodity clusters. We introduce a sub-graph centric programming abstraction that combines themore » scalability of a vertex centric approach with the flexibility of shared memory sub-graph computation. We map Connected Components, SSSP and PageRank algorithms to this model to illustrate its flexibility. Further, we empirically analyze GoFFish using several real world graphs and demonstrate its significant performance improvement, orders of magnitude in some cases, compared to Apache Giraph, the leading open source vertex centric implementation. We map Connected Components, SSSP and PageRank algorithms to this model to illustrate its flexibility. Further, we empirically analyze GoFFish using several real world graphs and demonstrate its significant performance improvement, orders of magnitude in some cases, compared to Apache Giraph, the leading open source vertex centric implementation.« less
Scalable Adaptive Graphics Environment (SAGE) Software for the Visualization of Large Data Sets on a Video Wall

NASA Technical Reports Server (NTRS)

Jedlovec, Gary; Srikishen, Jayanthi; Edwards, Rita; Cross, David; Welch, Jon; Smith, Matt

2013-01-01

The use of collaborative scientific visualization systems for the analysis, visualization, and sharing of "big data" available from new high resolution remote sensing satellite sensors or four-dimensional numerical model simulations is propelling the wider adoption of ultra-resolution tiled display walls interconnected by high speed networks. These systems require a globally connected and well-integrated operating environment that provides persistent visualization and collaboration services. This abstract and subsequent presentation describes a new collaborative visualization system installed for NASA's Shortterm Prediction Research and Transition (SPoRT) program at Marshall Space Flight Center and its use for Earth science applications. The system consists of a 3 x 4 array of 1920 x 1080 pixel thin bezel video monitors mounted on a wall in a scientific collaboration lab. The monitors are physically and virtually integrated into a 14' x 7' for video display. The display of scientific data on the video wall is controlled by a single Alienware Aurora PC with a 2nd Generation Intel Core 4.1 GHz processor, 32 GB memory, and an AMD Fire Pro W600 video card with 6 mini display port connections. Six mini display-to-dual DVI cables are used to connect the 12 individual video monitors. The open source Scalable Adaptive Graphics Environment (SAGE) windowing and media control framework, running on top of the Ubuntu 12 Linux operating system, allows several users to simultaneously control the display and storage of high resolution still and moving graphics in a variety of formats, on tiled display walls of any size. The Ubuntu operating system supports the open source Scalable Adaptive Graphics Environment (SAGE) software which provides a common environment, or framework, enabling its users to access, display and share a variety of data-intensive information. This information can be digital-cinema animations, high-resolution images, high-definition video-teleconferences, presentation slides, documents, spreadsheets or laptop screens. SAGE is cross-platform, community-driven, open-source visualization and collaboration middleware that utilizes shared national and international cyberinfrastructure for the advancement of scientific research and education.
Scalable Adaptive Graphics Environment (SAGE) Software for the Visualization of Large Data Sets on a Video Wall

NASA Astrophysics Data System (ADS)

Jedlovec, G.; Srikishen, J.; Edwards, R.; Cross, D.; Welch, J. D.; Smith, M. R.

2013-12-01

The use of collaborative scientific visualization systems for the analysis, visualization, and sharing of 'big data' available from new high resolution remote sensing satellite sensors or four-dimensional numerical model simulations is propelling the wider adoption of ultra-resolution tiled display walls interconnected by high speed networks. These systems require a globally connected and well-integrated operating environment that provides persistent visualization and collaboration services. This abstract and subsequent presentation describes a new collaborative visualization system installed for NASA's Short-term Prediction Research and Transition (SPoRT) program at Marshall Space Flight Center and its use for Earth science applications. The system consists of a 3 x 4 array of 1920 x 1080 pixel thin bezel video monitors mounted on a wall in a scientific collaboration lab. The monitors are physically and virtually integrated into a 14' x 7' for video display. The display of scientific data on the video wall is controlled by a single Alienware Aurora PC with a 2nd Generation Intel Core 4.1 GHz processor, 32 GB memory, and an AMD Fire Pro W600 video card with 6 mini display port connections. Six mini display-to-dual DVI cables are used to connect the 12 individual video monitors. The open source Scalable Adaptive Graphics Environment (SAGE) windowing and media control framework, running on top of the Ubuntu 12 Linux operating system, allows several users to simultaneously control the display and storage of high resolution still and moving graphics in a variety of formats, on tiled display walls of any size. The Ubuntu operating system supports the open source Scalable Adaptive Graphics Environment (SAGE) software which provides a common environment, or framework, enabling its users to access, display and share a variety of data-intensive information. This information can be digital-cinema animations, high-resolution images, high-definition video-teleconferences, presentation slides, documents, spreadsheets or laptop screens. SAGE is cross-platform, community-driven, open-source visualization and collaboration middleware that utilizes shared national and international cyberinfrastructure for the advancement of scientific research and education.
Fractional Steps methods for transient problems on commodity computer architectures

NASA Astrophysics Data System (ADS)

Krotkiewski, M.; Dabrowski, M.; Podladchikov, Y. Y.

2008-12-01

Fractional Steps methods are suitable for modeling transient processes that are central to many geological applications. Low memory requirements and modest computational complexity facilitates calculations on high-resolution three-dimensional models. An efficient implementation of Alternating Direction Implicit/Locally One-Dimensional schemes for an Opteron-based shared memory system is presented. The memory bandwidth usage, the main bottleneck on modern computer architectures, is specially addressed. High efficiency of above 2 GFlops per CPU is sustained for problems of 1 billion degrees of freedom. The optimized sequential implementation of all 1D sweeps is comparable in execution time to copying the used data in the memory. Scalability of the parallel implementation on up to 8 CPUs is close to perfect. Performing one timestep of the Locally One-Dimensional scheme on a system of 1000 3 unknowns on 8 CPUs takes only 11 s. We validate the LOD scheme using a computational model of an isolated inclusion subject to a constant far field flux. Next, we study numerically the evolution of a diffusion front and the effective thermal conductivity of composites consisting of multiple inclusions and compare the results with predictions based on the differential effective medium approach. Finally, application of the developed parabolic solver is suggested for a real-world problem of fluid transport and reactions inside a reservoir.
System-Level Integration of Mass Memory

NASA Technical Reports Server (NTRS)

Cox, Brian; Mellstrom, Jeffrey; Wysocky, Terry

2008-01-01

A report discusses integrating multiple memory modules on the high-speed serial interconnect (IEEE 1393) that is used by a spacecraft?s inter-module communications in order to ease data congestion and provide for a scalable, strong, flexible system that can meet new system-level mass memory requirements.
Support of Multidimensional Parallelism in the OpenMP Programming Model

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Jost, Gabriele

2003-01-01

OpenMP is the current standard for shared-memory programming. While providing ease of parallel programming, the OpenMP programming model also has limitations which often effect the scalability of applications. Examples for these limitations are work distribution and point-to-point synchronization among threads. We propose extensions to the OpenMP programming model which allow the user to easily distribute the work in multiple dimensions and synchronize the workflow among the threads. The proposed extensions include four new constructs and the associated runtime library. They do not require changes to the source code and can be implemented based on the existing OpenMP standard. We illustrate the concept in a prototype translator and test with benchmark codes and a cloud modeling code.
Scalable Conjunction Processing using Spatiotemporally Indexed Ephemeris Data

NASA Astrophysics Data System (ADS)

Budianto-Ho, I.; Johnson, S.; Sivilli, R.; Alberty, C.; Scarberry, R.

2014-09-01

The collision warnings produced by the Joint Space Operations Center (JSpOC) are of critical importance in protecting U.S. and allied spacecraft against destructive collisions and protecting the lives of astronauts during space flight. As the Space Surveillance Network (SSN) improves its sensor capabilities for tracking small and dim space objects, the number of tracked objects increases from thousands to hundreds of thousands of objects, while the number of potential conjunctions increases with the square of the number of tracked objects. Classical filtering techniques such as apogee and perigee filters have proven insufficient. Novel and orders of magnitude faster conjunction analysis algorithms are required to find conjunctions in a timely manner. Stellar Science has developed innovative filtering techniques for satellite conjunction processing using spatiotemporally indexed ephemeris data that efficiently and accurately reduces the number of objects requiring high-fidelity and computationally-intensive conjunction analysis. Two such algorithms, one based on the k-d Tree pioneered in robotics applications and the other based on Spatial Hash Tables used in computer gaming and animation, use, at worst, an initial O(N log N) preprocessing pass (where N is the number of tracked objects) to build large O(N) spatial data structures that substantially reduce the required number of O(N^2) computations, substituting linear memory usage for quadratic processing time. The filters have been implemented as Open Services Gateway initiative (OSGi) plug-ins for the Continuous Anomalous Orbital Situation Discriminator (CAOS-D) conjunction analysis architecture. We have demonstrated the effectiveness, efficiency, and scalability of the techniques using a catalog of 100,000 objects, an analysis window of one day, on a 64-core computer with 1TB shared memory. Each algorithm can process the full catalog in 6 minutes or less, almost a twenty-fold performance improvement over the baseline implementation running on the same machine. We will present an overview of the algorithms and results that demonstrate the scalability of our concepts.
Multipulse addressing of a Raman quantum memory: configurable beam splitting and efficient readout.

PubMed

Reim, K F; Nunn, J; Jin, X-M; Michelberger, P S; Champion, T F M; England, D G; Lee, K C; Kolthammer, W S; Langford, N K; Walmsley, I A

2012-06-29

Quantum memories are vital to the scalability of photonic quantum information processing (PQIP), since the storage of photons enables repeat-until-success strategies. On the other hand, the key element of all PQIP architectures is the beam splitter, which allows us to coherently couple optical modes. Here, we show how to combine these crucial functionalities by addressing a Raman quantum memory with multiple control pulses. The result is a coherent optical storage device with an extremely large time bandwidth product, that functions as an array of dynamically configurable beam splitters, and that can be read out with arbitrarily high efficiency. Networks of such devices would allow fully scalable PQIP, with applications in quantum computation, long distance quantum communications and quantum metrology.
Slices: A Scalable Partitioner for Finite Element Meshes

NASA Technical Reports Server (NTRS)

Ding, H. Q.; Ferraro, R. D.

1995-01-01

A parallel partitioner for partitioning unstructured finite element meshes on distributed memory architectures is developed. The element based partitioner can handle mixtures of different element types. All algorithms adopted in the partitioner are scalable, including a communication template for unpredictable incoming messages, as shown in actual timing measurements.

A scalable parallel black oil simulator on distributed memory parallel computers

NASA Astrophysics Data System (ADS)

Wang, Kun; Liu, Hui; Chen, Zhangxin

2015-11-01

This paper presents our work on developing a parallel black oil simulator for distributed memory computers based on our in-house parallel platform. The parallel simulator is designed to overcome the performance issues of common simulators that are implemented for personal computers and workstations. The finite difference method is applied to discretize the black oil model. In addition, some advanced techniques are employed to strengthen the robustness and parallel scalability of the simulator, including an inexact Newton method, matrix decoupling methods, and algebraic multigrid methods. A new multi-stage preconditioner is proposed to accelerate the solution of linear systems from the Newton methods. Numerical experiments show that our simulator is scalable and efficient, and is capable of simulating extremely large-scale black oil problems with tens of millions of grid blocks using thousands of MPI processes on parallel computers.
Scalability improvements to NRLMOL for DFT calculations of large molecules

NASA Astrophysics Data System (ADS)

Diaz, Carlos Manuel

Advances in high performance computing (HPC) have provided a way to treat large, computationally demanding tasks using thousands of processors. With the development of more powerful HPC architectures, the need to create efficient and scalable code has grown more important. Electronic structure calculations are valuable in understanding experimental observations and are routinely used for new materials predictions. For the electronic structure calculations, the memory and computation time are proportional to the number of atoms. Memory requirements for these calculations scale as N2, where N is the number of atoms. While the recent advances in HPC offer platforms with large numbers of cores, the limited amount of memory available on a given node and poor scalability of the electronic structure code hinder their efficient usage of these platforms. This thesis will present some developments to overcome these bottlenecks in order to study large systems. These developments, which are implemented in the NRLMOL electronic structure code, involve the use of sparse matrix storage formats and the use of linear algebra using sparse and distributed matrices. These developments along with other related development now allow ground state density functional calculations using up to 25,000 basis functions and the excited state calculations using up to 17,000 basis functions while utilizing all cores on a node. An example on a light-harvesting triad molecule is described. Finally, future plans to further improve the scalability will be presented.
The Efficiency and the Scalability of an Explicit Operator on an IBM POWER4 System

NASA Technical Reports Server (NTRS)

Frumkin, Michael; Biegel, Bryan A. (Technical Monitor)

2002-01-01

We present an evaluation of the efficiency and the scalability of an explicit CFD operator on an IBM POWER4 system. The POWER4 architecture exhibits a common trend in HPC architectures: boosting CPU processing power by increasing the number of functional units, while hiding the latency of memory access by increasing the depth of the memory hierarchy. The overall machine performance depends on the ability of the caches-buses-fabric-memory to feed the functional units with the data to be processed. In this study we evaluate the efficiency and scalability of one explicit CFD operator on an IBM POWER4. This operator performs computations at the points of a Cartesian grid and involves a few dozen floating point numbers and on the order of 100 floating point operations per grid point. The computations in all grid points are independent. Specifically, we estimate the efficiency of the RHS operator (SP of NPB) on a single processor as the observed/peak performance ratio. Then we estimate the scalability of the operator on a single chip (2 CPUs), a single MCM (8 CPUs), 16 CPUs, and the whole machine (32 CPUs). Then we perform the same measurements for a chache-optimized version of the RHS operator. For our measurements we use the HPM (Hardware Performance Monitor) counters available on the POWER4. These counters allow us to analyze the obtained performance results.
Message Passing vs. Shared Address Space on a Cluster of SMPs

NASA Technical Reports Server (NTRS)

Shan, Hongzhang; Singh, Jaswinder Pal; Oliker, Leonid; Biswas, Rupak

2000-01-01

The convergence of scalable computer architectures using clusters of PCs (or PC-SMPs) with commodity networking has become an attractive platform for high end scientific computing. Currently, message-passing and shared address space (SAS) are the two leading programming paradigms for these systems. Message-passing has been standardized with MPI, and is the most common and mature programming approach. However message-passing code development can be extremely difficult, especially for irregular structured computations. SAS offers substantial ease of programming, but may suffer from performance limitations due to poor spatial locality, and high protocol overhead. In this paper, we compare the performance of and programming effort, required for six applications under both programming models on a 32 CPU PC-SMP cluster. Our application suite consists of codes that typically do not exhibit high efficiency under shared memory programming. due to their high communication to computation ratios and complex communication patterns. Results indicate that SAS can achieve about half the parallel efficiency of MPI for most of our applications: however, on certain classes of problems SAS performance is competitive with MPI. We also present new algorithms for improving the PC cluster performance of MPI collective operations.
Scalability of voltage-controlled filamentary and nanometallic resistance memory devices.

PubMed

Lu, Yang; Lee, Jong Ho; Chen, I-Wei

2017-08-31

Much effort has been devoted to device and materials engineering to realize nanoscale resistance random access memory (RRAM) for practical applications, but a rational physical basis to be relied on to design scalable devices spanning many length scales is still lacking. In particular, there is no clear criterion for switching control in those RRAM devices in which resistance changes are limited to localized nanoscale filaments that experience concentrated heat, electric current and field. Here, we demonstrate voltage-controlled resistance switching, always at a constant characteristic critical voltage, for macro and nanodevices in both filamentary RRAM and nanometallic RRAM, and the latter switches uniformly and does not require a forming process. As a result, area-scalability can be achieved under a device-area-proportional current compliance for the low resistance state of the filamentary RRAM, and for both the low and high resistance states of the nanometallic RRAM. This finding will help design area-scalable RRAM at the nanoscale. It also establishes an analogy between RRAM and synapses, in which signal transmission is also voltage-controlled.
Scalable problems and memory bounded speedup

NASA Technical Reports Server (NTRS)

Sun, Xian-He; Ni, Lionel M.

1992-01-01

In this paper three models of parallel speedup are studied. They are fixed-size speedup, fixed-time speedup and memory-bounded speedup. The latter two consider the relationship between speedup and problem scalability. Two sets of speedup formulations are derived for these three models. One set considers uneven workload allocation and communication overhead and gives more accurate estimation. Another set considers a simplified case and provides a clear picture on the impact of the sequential portion of an application on the possible performance gain from parallel processing. The simplified fixed-size speedup is Amdahl's law. The simplified fixed-time speedup is Gustafson's scaled speedup. The simplified memory-bounded speedup contains both Amdahl's law and Gustafson's scaled speedup as special cases. This study leads to a better understanding of parallel processing.
A Secure and Efficient Scalable Secret Image Sharing Scheme with Flexible Shadow Sizes.

PubMed

Xie, Dong; Li, Lixiang; Peng, Haipeng; Yang, Yixian

2017-01-01

In a general (k, n) scalable secret image sharing (SSIS) scheme, the secret image is shared by n participants and any k or more than k participants have the ability to reconstruct it. The scalability means that the amount of information in the reconstructed image scales in proportion to the number of the participants. In most existing SSIS schemes, the size of each image shadow is relatively large and the dealer does not has a flexible control strategy to adjust it to meet the demand of differen applications. Besides, almost all existing SSIS schemes are not applicable under noise circumstances. To address these deficiencies, in this paper we present a novel SSIS scheme based on a brand-new technique, called compressed sensing, which has been widely used in many fields such as image processing, wireless communication and medical imaging. Our scheme has the property of flexibility, which means that the dealer can achieve a compromise between the size of each shadow and the quality of the reconstructed image. In addition, our scheme has many other advantages, including smooth scalability, noise-resilient capability, and high security. The experimental results and the comparison with similar works demonstrate the feasibility and superiority of our scheme.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Ahn, Gail-Joon

The project seeks an innovative framework to enable users to access and selectively share resources in distributed environments, enhancing the scalability of information sharing. We have investigated secure sharing & assurance approaches for ad-hoc collaboration, focused on Grids, Clouds, and ad-hoc network environments.
CloudMan as a platform for tool, data, and analysis distribution.

PubMed

Afgan, Enis; Chapman, Brad; Taylor, James

2012-11-27

Cloud computing provides an infrastructure that facilitates large scale computational analysis in a scalable, democratized fashion, However, in this context it is difficult to ensure sharing of an analysis environment and associated data in a scalable and precisely reproducible way. CloudMan (usecloudman.org) enables individual researchers to easily deploy, customize, and share their entire cloud analysis environment, including data, tools, and configurations. With the enabled customization and sharing of instances, CloudMan can be used as a platform for collaboration. The presented solution improves accessibility of cloud resources, tools, and data to the level of an individual researcher and contributes toward reproducibility and transparency of research solutions.
AgShare Open Knowledge: Improving Rural Communities through University Student Action Research

ERIC Educational Resources Information Center

Geith, Christine; Vignare, Karen

2013-01-01

The aim of AgShare is to create a scalable and sustainable collaboration of existing organizations for African publishing, localizing, and sharing of science-based teaching and learning materials that fill critical resource gaps in African MSc agriculture curriculum. Shared innovative practices are emerging through the AgShare projects, not only…
DOE Office of Scientific and Technical Information (OSTI.GOV)

Ghysels, Pieter; Li, Xiaoye S.; Rouet, Francois -Henry

Here, we present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factoriz ation leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite.more » The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK - STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices.« less
An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling

DOE PAGES

Ghysels, Pieter; Li, Xiaoye S.; Rouet, Francois -Henry; ...

2016-10-27

Here, we present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factoriz ation leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite.more » The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK - STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices.« less
Overcoming the drawback of lower sense margin in tunnel FET based dynamic memory along with enhanced charge retention and scalability

NASA Astrophysics Data System (ADS)

Navlakha, Nupur; Kranti, Abhinav

2017-11-01

The work reports on the use of a planar tri-gate tunnel field effect transistor (TFET) to operate as dynamic memory at 85 °C with an enhanced sense margin (SM). Two symmetric gates (G1) aligned to the source at a partial region of intrinsic film result into better electrostatic control that regulates the read mechanism based on band-to-band tunneling, while the other gate (G2), positioned adjacent to the first front gate is responsible for charge storage and sustenance. The proposed architecture results in an enhanced SM of ˜1.2 μA μm-1 along with a longer retention time (RT) of ˜1.8 s at 85 °C, for a total length of 600 nm. The double gate architecture towards the source increases the tunneling current and also reduces short channel effects, enhancing SM and scalability, thereby overcoming the critical bottleneck faced by TFET based dynamic memories. The work also discusses the impact of overlap/underlap and interface charges on the performance of TFET based dynamic memory. Insights into device operation demonstrate that the choice of appropriate architecture and biases not only limit the trade-off between SM and RT, but also result in improved scalability with drain voltage and total length being scaled down to 0.8 V and 115 nm, respectively.
Towards robust algorithms for current deposition and dynamic load-balancing in a GPU particle in cell code

NASA Astrophysics Data System (ADS)

Rossi, Francesco; Londrillo, Pasquale; Sgattoni, Andrea; Sinigardi, Stefano; Turchetti, Giorgio

2012-12-01

We present `jasmine', an implementation of a fully relativistic, 3D, electromagnetic Particle-In-Cell (PIC) code, capable of running simulations in various laser plasma acceleration regimes on Graphics-Processing-Units (GPUs) HPC clusters. Standard energy/charge preserving FDTD-based algorithms have been implemented using double precision and quadratic (or arbitrary sized) shape functions for the particle weighting. When porting a PIC scheme to the GPU architecture (or, in general, a shared memory environment), the particle-to-grid operations (e.g. the evaluation of the current density) require special care to avoid memory inconsistencies and conflicts. Here we present a robust implementation of this operation that is efficient for any number of particles per cell and particle shape function order. Our algorithm exploits the exposed GPU memory hierarchy and avoids the use of atomic operations, which can hurt performance especially when many particles lay on the same cell. We show the code multi-GPU scalability results and present a dynamic load-balancing algorithm. The code is written using a python-based C++ meta-programming technique which translates in a high level of modularity and allows for easy performance tuning and simple extension of the core algorithms to various simulation schemes.
JuxtaView - A tool for interactive visualization of large imagery on scalable tiled displays

USGS Publications Warehouse

Krishnaprasad, N.K.; Vishwanath, V.; Venkataraman, S.; Rao, A.G.; Renambot, L.; Leigh, J.; Johnson, A.E.; Davis, B.

2004-01-01

JuxtaView is a cluster-based application for viewing ultra-high-resolution images on scalable tiled displays. We present in JuxtaView, a new parallel computing and distributed memory approach for out-of-core montage visualization, using LambdaRAM, a software-based network-level cache system. The ultimate goal of JuxtaView is to enable a user to interactively roam through potentially terabytes of distributed, spatially referenced image data such as those from electron microscopes, satellites and aerial photographs. In working towards this goal, we describe our first prototype implemented over a local area network, where the image is distributed using LambdaRAM, on the memory of all nodes of a PC cluster driving a tiled display wall. Aggressive pre-fetching schemes employed by LambdaRAM help to reduce latency involved in remote memory access. We compare LambdaRAM with a more traditional memory-mapped file approach for out-of-core visualization. ?? 2004 IEEE.
Enabling Graph Appliance for Genome Assembly

DOE Office of Scientific and Technical Information (OSTI.GOV)

Singh, Rina; Graves, Jeffrey A; Lee, Sangkeun

2015-01-01

In recent years, there has been a huge growth in the amount of genomic data available as reads generated from various genome sequencers. The number of reads generated can be huge, ranging from hundreds to billions of nucleotide, each varying in size. Assembling such large amounts of data is one of the challenging computational problems for both biomedical and data scientists. Most of the genome assemblers developed have used de Bruijn graph techniques. A de Bruijn graph represents a collection of read sequences by billions of vertices and edges, which require large amounts of memory and computational power to storemore » and process. This is the major drawback to de Bruijn graph assembly. Massively parallel, multi-threaded, shared memory systems can be leveraged to overcome some of these issues. The objective of our research is to investigate the feasibility and scalability issues of de Bruijn graph assembly on Cray s Urika-GD system; Urika-GD is a high performance graph appliance with a large shared memory and massively multithreaded custom processor designed for executing SPARQL queries over large-scale RDF data sets. However, to the best of our knowledge, there is no research on representing a de Bruijn graph as an RDF graph or finding Eulerian paths in RDF graphs using SPARQL for potential genome discovery. In this paper, we address the issues involved in representing a de Bruin graphs as RDF graphs and propose an iterative querying approach for finding Eulerian paths in large RDF graphs. We evaluate the performance of our implementation on real world ebola genome datasets and illustrate how genome assembly can be accomplished with Urika-GD using iterative SPARQL queries.« less
A Secure and Efficient Scalable Secret Image Sharing Scheme with Flexible Shadow Sizes

PubMed Central

Xie, Dong; Li, Lixiang; Peng, Haipeng; Yang, Yixian

2017-01-01

In a general (k, n) scalable secret image sharing (SSIS) scheme, the secret image is shared by n participants and any k or more than k participants have the ability to reconstruct it. The scalability means that the amount of information in the reconstructed image scales in proportion to the number of the participants. In most existing SSIS schemes, the size of each image shadow is relatively large and the dealer does not has a flexible control strategy to adjust it to meet the demand of differen applications. Besides, almost all existing SSIS schemes are not applicable under noise circumstances. To address these deficiencies, in this paper we present a novel SSIS scheme based on a brand-new technique, called compressed sensing, which has been widely used in many fields such as image processing, wireless communication and medical imaging. Our scheme has the property of flexibility, which means that the dealer can achieve a compromise between the size of each shadow and the quality of the reconstructed image. In addition, our scheme has many other advantages, including smooth scalability, noise-resilient capability, and high security. The experimental results and the comparison with similar works demonstrate the feasibility and superiority of our scheme. PMID:28072851
Performances of multiprocessor multidisk architectures for continuous media storage

NASA Astrophysics Data System (ADS)

Gennart, Benoit A.; Messerli, Vincent; Hersch, Roger D.

1996-03-01

Multimedia interfaces increase the need for large image databases, capable of storing and reading streams of data with strict synchronicity and isochronicity requirements. In order to fulfill these requirements, we consider a parallel image server architecture which relies on arrays of intelligent disk nodes, each disk node being composed of one processor and one or more disks. This contribution analyzes through bottleneck performance evaluation and simulation the behavior of two multi-processor multi-disk architectures: a point-to-point architecture and a shared-bus architecture similar to current multiprocessor workstation architectures. We compare the two architectures on the basis of two multimedia algorithms: the compute-bound frame resizing by resampling and the data-bound disk-to-client stream transfer. The results suggest that the shared bus is a potential bottleneck despite its very high hardware throughput (400Mbytes/s) and that an architecture with addressable local memories located closely to their respective processors could partially remove this bottleneck. The point- to-point architecture is scalable and able to sustain high throughputs for simultaneous compute- bound and data-bound operations.
HYDRA : High-speed simulation architecture for precision spacecraft formation simulation

NASA Technical Reports Server (NTRS)

Martin, Bryan J.; Sohl, Garett.

2003-01-01

e Hierarchical Distributed Reconfigurable Architecture- is a scalable simulation architecture that provides flexibility and ease-of-use which take advantage of modern computation and communication hardware. It also provides the ability to implement distributed - or workstation - based simulations and high-fidelity real-time simulation from a common core. Originally designed to serve as a research platform for examining fundamental challenges in formation flying simulation for future space missions, it is also finding use in other missions and applications, all of which can take advantage of the underlying Object-Oriented structure to easily produce distributed simulations. Hydra automates the process of connecting disparate simulation components (Hydra Clients) through a client server architecture that uses high-level descriptions of data associated with each client to find and forge desirable connections (Hydra Services) at run time. Services communicate through the use of Connectors, which abstract messaging to provide single-interface access to any desired communication protocol, such as from shared-memory message passing to TCP/IP to ACE and COBRA. Hydra shares many features with the HLA, although providing more flexibility in connectivity services and behavior overriding.
High Performance Computing Multicast

DTIC Science & Technology

2012-02-01

responsiveness, first-tier applications often implement replicated in- memory key-value stores , using them to store state or to cache data from services...alternative that replicates data , combines agreement on update ordering with amnesia freedom, and supports both good scalability and fast response. A...alternative that replicates data , combines agreement on update ordering with amnesia freedom, and supports both good scalability and fast response

CloudMan as a platform for tool, data, and analysis distribution

PubMed Central

2012-01-01

Background Cloud computing provides an infrastructure that facilitates large scale computational analysis in a scalable, democratized fashion, However, in this context it is difficult to ensure sharing of an analysis environment and associated data in a scalable and precisely reproducible way. Results CloudMan (usecloudman.org) enables individual researchers to easily deploy, customize, and share their entire cloud analysis environment, including data, tools, and configurations. Conclusions With the enabled customization and sharing of instances, CloudMan can be used as a platform for collaboration. The presented solution improves accessibility of cloud resources, tools, and data to the level of an individual researcher and contributes toward reproducibility and transparency of research solutions. PMID:23181507
SU-E-T-531: Performance Evaluation of Multithreaded Geant4 for Proton Therapy Dose Calculations in a High Performance Computing Facility

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shin, J; Coss, D; McMurry, J

Purpose: To evaluate the efficiency of multithreaded Geant4 (Geant4-MT, version 10.0) for proton Monte Carlo dose calculations using a high performance computing facility. Methods: Geant4-MT was used to calculate 3D dose distributions in 1×1×1 mm3 voxels in a water phantom and patient's head with a 150 MeV proton beam covering approximately 5×5 cm2 in the water phantom. Three timestamps were measured on the fly to separately analyze the required time for initialization (which cannot be parallelized), processing time of individual threads, and completion time. Scalability of averaged processing time per thread was calculated as a function of thread number (1,more » 100, 150, and 200) for both 1M and 50 M histories. The total memory usage was recorded. Results: Simulations with 50 M histories were fastest with 100 threads, taking approximately 1.3 hours and 6 hours for the water phantom and the CT data, respectively with better than 1.0 % statistical uncertainty. The calculations show 1/N scalability in the event loops for both cases. The gains from parallel calculations started to decrease with 150 threads. The memory usage increases linearly with number of threads. No critical failures were observed during the simulations. Conclusion: Multithreading in Geant4-MT decreased simulation time in proton dose distribution calculations by a factor of 64 and 54 at a near optimal 100 threads for water phantom and patient's data respectively. Further simulations will be done to determine the efficiency at the optimal thread number. Considering the trend of computer architecture development, utilizing Geant4-MT for radiotherapy simulations is an excellent cost-effective alternative for a distributed batch queuing system. However, because the scalability depends highly on simulation details, i.e., the ratio of the processing time of one event versus waiting time to access for the shared event queue, a performance evaluation as described is recommended.« less
PhreeqcRM: A reaction module for transport simulators based on the geochemical model PHREEQC

USGS Publications Warehouse

Parkhurst, David L.; Wissmeier, Laurin

2015-01-01

PhreeqcRM is a geochemical reaction module designed specifically to perform equilibrium and kinetic reaction calculations for reactive transport simulators that use an operator-splitting approach. The basic function of the reaction module is to take component concentrations from the model cells of the transport simulator, run geochemical reactions, and return updated component concentrations to the transport simulator. If multicomponent diffusion is modeled (e.g., Nernst–Planck equation), then aqueous species concentrations can be used instead of component concentrations. The reaction capabilities are a complete implementation of the reaction capabilities of PHREEQC. In each cell, the reaction module maintains the composition of all of the reactants, which may include minerals, exchangers, surface complexers, gas phases, solid solutions, and user-defined kinetic reactants.PhreeqcRM assigns initial and boundary conditions for model cells based on standard PHREEQC input definitions (files or strings) of chemical compositions of solutions and reactants. Additional PhreeqcRM capabilities include methods to eliminate reaction calculations for inactive parts of a model domain, transfer concentrations and other model properties, and retrieve selected results. The module demonstrates good scalability for parallel processing by using multiprocessing with MPI (message passing interface) on distributed memory systems, and limited scalability using multithreading with OpenMP on shared memory systems. PhreeqcRM is written in C++, but interfaces allow methods to be called from C or Fortran. By using the PhreeqcRM reaction module, an existing multicomponent transport simulator can be extended to simulate a wide range of geochemical reactions. Results of the implementation of PhreeqcRM as the reaction engine for transport simulators PHAST and FEFLOW are shown by using an analytical solution and the reactive transport benchmark of MoMaS.
Parallel Conjugate Gradient: Effects of Ordering Strategies, Programming Paradigms, and Architectural Platforms

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Heber, Gerd; Biswas, Rupak

2000-01-01

The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the floating-point operations within a CG iteration. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and SPMV using different programming paradigms and architectures. Results show that for this class of applications, ordering significantly improves overall performance, that cache reuse may be more important than reducing communication, and that it is possible to achieve message passing performance using shared memory constructs through careful data ordering and distribution. However, a multi-threaded implementation of CG on the Tera MTA does not require special ordering or partitioning to obtain high efficiency and scalability.
Performing an allreduce operation using shared memory

DOEpatents

Archer, Charles J [Rochester, MN; Dozsa, Gabor [Ardsley, NY; Ratterman, Joseph D [Rochester, MN; Smith, Brian E [Rochester, MN

2012-04-17

Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit.
Performing an allreduce operation using shared memory

DOEpatents

Archer, Charles J; Dozsa, Gabor; Ratterman, Joseph D; Smith, Brian E

2014-06-10

Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit.
Hierarchical resilience with lightweight threads.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wheeler, Kyle Bruce

2011-10-01

This paper proposes methodology for providing robustness and resilience for a highly threaded distributed- and shared-memory environment based on well-defined inputs and outputs to lightweight tasks. These inputs and outputs form a failure 'barrier', allowing tasks to be restarted or duplicated as necessary. These barriers must be expanded based on task behavior, such as communication between tasks, but do not prohibit any given behavior. One of the trends in high-performance computing codes seems to be a trend toward self-contained functions that mimic functional programming. Software designers are trending toward a model of software design where their core functions are specifiedmore » in side-effect free or low-side-effect ways, wherein the inputs and outputs of the functions are well-defined. This provides the ability to copy the inputs to wherever they need to be - whether that's the other side of the PCI bus or the other side of the network - do work on that input using local memory, and then copy the outputs back (as needed). This design pattern is popular among new distributed threading environment designs. Such designs include the Barcelona STARS system, distributed OpenMP systems, the Habanero-C and Habanero-Java systems from Vivek Sarkar at Rice University, the HPX/ParalleX model from LSU, as well as our own Scalable Parallel Runtime effort (SPR) and the Trilinos stateless kernels. This design pattern is also shared by CUDA and several OpenMP extensions for GPU-type accelerators (e.g. the PGI OpenMP extensions).« less
Scalable, High-performance 3D Imaging Software Platform: System Architecture and Application to Virtual Colonoscopy

PubMed Central

Yoshida, Hiroyuki; Wu, Yin; Cai, Wenli; Brett, Bevin

2013-01-01

One of the key challenges in three-dimensional (3D) medical imaging is to enable the fast turn-around time, which is often required for interactive or real-time response. This inevitably requires not only high computational power but also high memory bandwidth due to the massive amount of data that need to be processed. In this work, we have developed a software platform that is designed to support high-performance 3D medical image processing for a wide range of applications using increasingly available and affordable commodity computing systems: multi-core, clusters, and cloud computing systems. To achieve scalable, high-performance computing, our platform (1) employs size-adaptive, distributable block volumes as a core data structure for efficient parallelization of a wide range of 3D image processing algorithms; (2) supports task scheduling for efficient load distribution and balancing; and (3) consists of a layered parallel software libraries that allow a wide range of medical applications to share the same functionalities. We evaluated the performance of our platform by applying it to an electronic cleansing system in virtual colonoscopy, with initial experimental results showing a 10 times performance improvement on an 8-core workstation over the original sequential implementation of the system. PMID:23366803
OSCAR API for Real-Time Low-Power Multicores and Its Performance on Multicores and SMP Servers

NASA Astrophysics Data System (ADS)

Kimura, Keiji; Mase, Masayoshi; Mikami, Hiroki; Miyamoto, Takamichi; Shirako, Jun; Kasahara, Hironori

OSCAR (Optimally Scheduled Advanced Multiprocessor) API has been designed for real-time embedded low-power multicores to generate parallel programs for various multicores from different vendors by using the OSCAR parallelizing compiler. The OSCAR API has been developed by Waseda University in collaboration with Fujitsu Laboratory, Hitachi, NEC, Panasonic, Renesas Technology, and Toshiba in an METI/NEDO project entitled "Multicore Technology for Realtime Consumer Electronics." By using the OSCAR API as an interface between the OSCAR compiler and backend compilers, the OSCAR compiler enables hierarchical multigrain parallel processing with memory optimization under capacity restriction for cache memory, local memory, distributed shared memory, and on-chip/off-chip shared memory; data transfer using a DMA controller; and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating, and power gating for various embedded multicores. In addition, a parallelized program automatically generated by the OSCAR compiler with OSCAR API can be compiled by the ordinary OpenMP compilers since the OSCAR API is designed on a subset of the OpenMP. This paper describes the OSCAR API and its compatibility with the OSCAR compiler by showing code examples. Performance evaluations of the OSCAR compiler and the OSCAR API are carried out using an IBM Power5+ workstation, an IBM Power6 high-end SMP server, and a newly developed consumer electronics multicore chip RP2 by Renesas, Hitachi and Waseda. From the results of scalability evaluation, it is found that on an average, the OSCAR compiler with the OSCAR API can exploit 5.8 times speedup over the sequential execution on the Power5+ workstation with eight cores and 2.9 times speedup on RP2 with four cores, respectively. In addition, the OSCAR compiler can accelerate an IBM XL Fortran compiler up to 3.3 times on the Power6 SMP server. Due to low-power optimization on RP2, the OSCAR compiler with the OSCAR API achieves a maximum power reduction of 84% in the real-time execution mode.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Shamis, Pavel; Graham, Richard L; Gorentla Venkata, Manjunath

The scalability and performance of collective communication operations limit the scalability and performance of many scientific applications. This paper presents two new blocking and nonblocking Broadcast algorithms for communicators with arbitrary communication topology, and studies their performance. These algorithms benefit from increased concurrency and a reduced memory footprint, making them suitable for use on large-scale systems. Measuring small, medium, and large data Broadcasts on a Cray-XT5, using 24,576 MPI processes, the Cheetah algorithms outperform the native MPI on that system by 51%, 69%, and 9%, respectively, at the same process count. These results demonstrate an algorithmic approach to the implementationmore » of the important class of collective communications, which is high performing, scalable, and also uses resources in a scalable manner.« less
Memory-Scalable GPU Spatial Hierarchy Construction.

PubMed

Qiming Hou; Xin Sun; Kun Zhou; Lauterbach, C; Manocha, D

2011-04-01

Recent GPU algorithms for constructing spatial hierarchies have achieved promising performance for moderately complex models by using the breadth-first search (BFS) construction order. While being able to exploit the massive parallelism on the GPU, the BFS order also consumes excessive GPU memory, which becomes a serious issue for interactive applications involving very complex models with more than a few million triangles. In this paper, we propose to use the partial breadth-first search (PBFS) construction order to control memory consumption while maximizing performance. We apply the PBFS order to two hierarchy construction algorithms. The first algorithm is for kd-trees that automatically balances between the level of parallelism and intermediate memory usage. With PBFS, peak memory consumption during construction can be efficiently controlled without costly CPU-GPU data transfer. We also develop memory allocation strategies to effectively limit memory fragmentation. The resulting algorithm scales well with GPU memory and constructs kd-trees of models with millions of triangles at interactive rates on GPUs with 1 GB memory. Compared with existing algorithms, our algorithm is an order of magnitude more scalable for a given GPU memory bound. The second algorithm is for out-of-core bounding volume hierarchy (BVH) construction for very large scenes based on the PBFS construction order. At each iteration, all constructed nodes are dumped to the CPU memory, and the GPU memory is freed for the next iteration's use. In this way, the algorithm is able to build trees that are too large to be stored in the GPU memory. Experiments show that our algorithm can construct BVHs for scenes with up to 20 M triangles, several times larger than previous GPU algorithms.
Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures

DTIC Science & Technology

2017-10-04

Report: Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures The views, opinions and/or findings contained in this...Chapel Hill Title: Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures Report Term: 0-Other Email: dm...algorithms for scientific and geometric computing by exploiting the power and performance efficiency of heterogeneous shared memory architectures . These
Network selection, Information filtering and Scalable computation

NASA Astrophysics Data System (ADS)

Ye, Changqing

This dissertation explores two application scenarios of sparsity pursuit method on large scale data sets. The first scenario is classification and regression in analyzing high dimensional structured data, where predictors corresponds to nodes of a given directed graph. This arises in, for instance, identification of disease genes for the Parkinson's diseases from a network of candidate genes. In such a situation, directed graph describes dependencies among the genes, where direction of edges represent certain causal effects. Key to high-dimensional structured classification and regression is how to utilize dependencies among predictors as specified by directions of the graph. In this dissertation, we develop a novel method that fully takes into account such dependencies formulated through certain nonlinear constraints. We apply the proposed method to two applications, feature selection in large margin binary classification and in linear regression. We implement the proposed method through difference convex programming for the cost function and constraints. Finally, theoretical and numerical analyses suggest that the proposed method achieves the desired objectives. An application to disease gene identification is presented. The second application scenario is personalized information filtering which extracts the information specifically relevant to a user, predicting his/her preference over a large number of items, based on the opinions of users who think alike or its content. This problem is cast into the framework of regression and classification, where we introduce novel partial latent models to integrate additional user-specific and content-specific predictors, for higher predictive accuracy. In particular, we factorize a user-over-item preference matrix into a product of two matrices, each representing a user's preference and an item preference by users. Then we propose a likelihood method to seek a sparsest latent factorization, from a class of over-complete factorizations, possibly with a high percentage of missing values. This promotes additional sparsity beyond rank reduction. Computationally, we design methods based on a ``decomposition and combination'' strategy, to break large-scale optimization into many small subproblems to solve in a recursive and parallel manner. On this basis, we implement the proposed methods through multi-platform shared-memory parallel programming, and through Mahout, a library for scalable machine learning and data mining, for mapReduce computation. For example, our methods are scalable to a dataset consisting of three billions of observations on a single machine with sufficient memory, having good timings. Both theoretical and numerical investigations show that the proposed methods exhibit significant improvement in accuracy over state-of-the-art scalable methods.
The effect of the order in which episodic autobiographical memories versus autobiographical knowledge are shared on feelings of closeness.

PubMed

Brandon, Nicole R; Beike, Denise R; Cole, Holly E

2017-07-01

Autobiographical memories (AMs) can be used to create and maintain closeness with others [Alea, N., & Bluck, S. (2003). Why are you telling me that? A conceptual model of the social function of autobiographical memory. Memory, 11(2), 165-178]. However, the differential effects of memory specificity are not well established. Two studies with 148 participants tested whether the order in which autobiographical knowledge (AK) and specific episodic AM (EAM) are shared affects feelings of closeness. Participants read two memories hypothetically shared by each of four strangers. The strangers first shared either AK or an EAM, and then shared either AK or an EAM. Participants were randomly assigned to read either positive or negative AMs from the strangers. Findings suggest that people feel closer to those who share positive AMs in the same way they construct memories: starting with general and moving to specific.
Adaptive packet switch with an optical core (demonstrator)

NASA Astrophysics Data System (ADS)

Abdo, Ahmad; Bishtein, Vadim; Clark, Stewart A.; Dicorato, Pino; Lu, David T.; Paredes, Sofia A.; Taebi, Sareh; Hall, Trevor J.

2004-11-01

A three-stage opto-electronic packet switch architecture is described consisting of a reconfigurable optical centre stage surrounded by two electronic buffering stages partitioned into sectors to ease memory contention. A Flexible Bandwidth Provision (FBP) algorithm, implemented on a soft-core processor, is used to change the configuration of the input sectors and optical centre stage to set up internal paths that will provide variable bandwidth to serve the traffic. The switch is modeled by a bipartite graph built from a service matrix, which is a function of the arriving traffic. The bipartite graph is decomposed by solving an edge-colouring problem and the resulting permutations are used to configure the switch. Simulation results show that this architecture exhibits a dramatic reduction of complexity and increased potential for scalability, at the price of only a modest spatial speed-up k, 1
Automation of Data Traffic Control on DSM Architecture

NASA Technical Reports Server (NTRS)

Frumkin, Michael; Jin, Hao-Qiang; Yan, Jerry

2001-01-01

The design of distributed shared memory (DSM) computers liberates users from the duty to distribute data across processors and allows for the incremental development of parallel programs using, for example, OpenMP or Java threads. DSM architecture greatly simplifies the development of parallel programs having good performance on a few processors. However, to achieve a good program scalability on DSM computers requires that the user understand data flow in the application and use various techniques to avoid data traffic congestions. In this paper we discuss a number of such techniques, including data blocking, data placement, data transposition and page size control and evaluate their efficiency on the NAS (NASA Advanced Supercomputing) Parallel Benchmarks. We also present a tool which automates the detection of constructs causing data congestions in Fortran array oriented codes and advises the user on code transformations for improving data traffic in the application.
Three-Dimensional High-Lift Analysis Using a Parallel Unstructured Multigrid Solver

NASA Technical Reports Server (NTRS)

Mavriplis, Dimitri J.

1998-01-01

A directional implicit unstructured agglomeration multigrid solver is ported to shared and distributed memory massively parallel machines using the explicit domain-decomposition and message-passing approach. Because the algorithm operates on local implicit lines in the unstructured mesh, special care is required in partitioning the problem for parallel computing. A weighted partitioning strategy is described which avoids breaking the implicit lines across processor boundaries, while incurring minimal additional communication overhead. Good scalability is demonstrated on a 128 processor SGI Origin 2000 machine and on a 512 processor CRAY T3E machine for reasonably fine grids. The feasibility of performing large-scale unstructured grid calculations with the parallel multigrid algorithm is demonstrated by computing the flow over a partial-span flap wing high-lift geometry on a highly resolved grid of 13.5 million points in approximately 4 hours of wall clock time on the CRAY T3E.
Towards Scalable Graph Computation on Mobile Devices.

PubMed

Chen, Yiqi; Lin, Zhiyuan; Pienta, Robert; Kahng, Minsuk; Chau, Duen Horng

2014-10-01

Mobile devices have become increasingly central to our everyday activities, due to their portability, multi-touch capabilities, and ever-improving computational power. Such attractive features have spurred research interest in leveraging mobile devices for computation. We explore a novel approach that aims to use a single mobile device to perform scalable graph computation on large graphs that do not fit in the device's limited main memory, opening up the possibility of performing on-device analysis of large datasets, without relying on the cloud. Based on the familiar memory mapping capability provided by today's mobile operating systems, our approach to scale up computation is powerful and intentionally kept simple to maximize its applicability across the iOS and Android platforms. Our experiments demonstrate that an iPad mini can perform fast computation on large real graphs with as many as 272 million edges (Google+ social graph), at a speed that is only a few times slower than a 13″ Macbook Pro. Through creating a real world iOS app with this technique, we demonstrate the strong potential application for scalable graph computation on a single mobile device using our approach.
Scalable and expressive medical terminologies.

PubMed

Mays, E; Weida, R; Dionne, R; Laker, M; White, B; Liang, C; Oles, F J

1996-01-01

The K-Rep system, based on description logic, is used to represent and reason with large and expressive controlled medical terminologies. Expressive concept descriptions incorporate semantically precise definitions composed using logical operators, together with important non-semantic information such as synonyms and codes. Examples are drawn from our experience with K-Rep in modeling the InterMed laboratory terminology and also developing a large clinical terminology now in production use at Kaiser-Permanente. System-level scalability of performance is achieved through an object-oriented database system which efficiently maps persistent memory to virtual memory. Equally important is conceptual scalability-the ability to support collaborative development, organization, and visualization of a substantial terminology as it evolves over time. K-Rep addresses this need by logically completing concept definitions and automatically classifying concepts in a taxonomy via subsumption inferences. The K-Rep system includes a general-purpose GUI environment for terminology development and browsing, a custom interface for formulary term maintenance, a C+2 application program interface, and a distributed client-server mode which provides lightweight clients with efficient run-time access to K-Rep by means of a scripting language.
Programming time-multiplexed reconfigurable hardware using a scalable neuromorphic compiler.

PubMed

Minkovich, Kirill; Srinivasa, Narayan; Cruz-Albrecht, Jose M; Cho, Youngkwan; Nogin, Aleksey

2012-06-01

Scalability and connectivity are two key challenges in designing neuromorphic hardware that can match biological levels. In this paper, we describe a neuromorphic system architecture design that addresses an approach to meet these challenges using traditional complementary metal-oxide-semiconductor (CMOS) hardware. A key requirement in realizing such neural architectures in hardware is the ability to automatically configure the hardware to emulate any neural architecture or model. The focus for this paper is to describe the details of such a programmable front-end. This programmable front-end is composed of a neuromorphic compiler and a digital memory, and is designed based on the concept of synaptic time-multiplexing (STM). The neuromorphic compiler automatically translates any given neural architecture to hardware switch states and these states are stored in digital memory to enable desired neural architectures. STM enables our proposed architecture to address scalability and connectivity using traditional CMOS hardware. We describe the details of the proposed design and the programmable front-end, and provide examples to illustrate its capabilities. We also provide perspectives for future extensions and potential applications.

Towards Scalable Graph Computation on Mobile Devices

PubMed Central

Chen, Yiqi; Lin, Zhiyuan; Pienta, Robert; Kahng, Minsuk; Chau, Duen Horng

2015-01-01

Mobile devices have become increasingly central to our everyday activities, due to their portability, multi-touch capabilities, and ever-improving computational power. Such attractive features have spurred research interest in leveraging mobile devices for computation. We explore a novel approach that aims to use a single mobile device to perform scalable graph computation on large graphs that do not fit in the device's limited main memory, opening up the possibility of performing on-device analysis of large datasets, without relying on the cloud. Based on the familiar memory mapping capability provided by today's mobile operating systems, our approach to scale up computation is powerful and intentionally kept simple to maximize its applicability across the iOS and Android platforms. Our experiments demonstrate that an iPad mini can perform fast computation on large real graphs with as many as 272 million edges (Google+ social graph), at a speed that is only a few times slower than a 13″ Macbook Pro. Through creating a real world iOS app with this technique, we demonstrate the strong potential application for scalable graph computation on a single mobile device using our approach. PMID:25859564
Shared Semantics and the Use of Organizational Memories for E-Mail Communications.

ERIC Educational Resources Information Center

Schwartz, David G.

1998-01-01

Examines the use of shared semantics information to link concepts in an organizational memory to e-mail communications. Presents a framework for determining shared semantics based on organizational and personal user profiles. Illustrates how shared semantics are used by the HyperMail system to help link organizational memories (OM) content to…
GoFFish: Graph-Oriented Framework for Foresight and Insight Using Scalable Heuristics

DTIC Science & Technology

2015-09-01

series of research challenges that arise from the above and related to: scalability, data partitioning, memory representation and storage, exe - cution...job with varied deadlines in Jan, 2013. Further, we also build an SPSP model using Jan, 2013 data post facto as an ideal case. Fig. 7 reports the...Advanced Studies on Collaborative Research . Riverton, NJ, USA: IBM Corp., 2009, pp. 101–111. [42] G. J. Nutt, “The evolution towards flexible workflow
'I Can't Concentrate': A Feasibility Study with Young Refugees in Sweden on Developing Science-Driven Interventions for Intrusive Memories Related to Trauma.

PubMed

Holmes, Emily A; Ghaderi, Ata; Eriksson, Ellinor; Lauri, Klara Olofsdotter; Kukacka, Olivia M; Mamish, Maya; James, Ella L; Visser, Renée M

2017-03-01

The number of refugees is the highest ever worldwide. Many have experienced trauma in home countries or on their escape which has mental health sequelae. Intrusive memories comprise distressing scenes of trauma which spring to mind unbidden. Development of novel scalable psychological interventions is needed urgently. We propose that brief cognitive science-driven interventions should be developed which pinpoint a focal symptom alongside a means to monitor it using behavioural techniques. The aim of the current study was to assess the feasibility and acceptability of the methodology required to develop such an intervention. In this study we recruited 22 refugees (16-25 years), predominantly from Syria and residing in Sweden. Participants were asked to monitor the frequency of intrusive memories of trauma using a daily diary; rate intrusions and concentration; and complete a 1-session behavioural intervention involving Tetris game-play via smartphone. Frequency of intrusive memories was high, and associated with high levels of distress and impaired concentration. Levels of engagement with study procedures were highly promising. The current work opens the way for developing novel cognitive behavioural approaches for traumatized refugees that are mechanistically derived, freely available and internationally scalable.
Spindle

DOE Office of Scientific and Technical Information (OSTI.GOV)

2013-04-04

Spindle is software infrastructure that solves file system scalabiltiy problems associated with starting dynamically linked applications in HPC environments. When an HPC applications starts up thousands of pricesses at once, and those processes simultaneously access a shared file system to look for shared libraries, it can cause significant performance problems for both the application and other users. Spindle scalably coordinates the distribution of shared libraries to an application to avoid hammering the shared file system.
Nanophotonic rare-earth quantum memory with optically controlled retrieval

NASA Astrophysics Data System (ADS)

Zhong, Tian; Kindem, Jonathan M.; Bartholomew, John G.; Rochman, Jake; Craiciu, Ioana; Miyazono, Evan; Bettinelli, Marco; Cavalli, Enrico; Verma, Varun; Nam, Sae Woo; Marsili, Francesco; Shaw, Matthew D.; Beyer, Andrew D.; Faraon, Andrei

2017-09-01

Optical quantum memories are essential elements in quantum networks for long-distance distribution of quantum entanglement. Scalable development of quantum network nodes requires on-chip qubit storage functionality with control of the readout time. We demonstrate a high-fidelity nanophotonic quantum memory based on a mesoscopic neodymium ensemble coupled to a photonic crystal cavity. The nanocavity enables >95% spin polarization for efficient initialization of the atomic frequency comb memory and time bin-selective readout through an enhanced optical Stark shift of the comb frequencies. Our solid-state memory is integrable with other chip-scale photon source and detector devices for multiplexed quantum and classical information processing at the network nodes.
Continuous-variable quantum computing in optical time-frequency modes using quantum memories.

PubMed

Humphreys, Peter C; Kolthammer, W Steven; Nunn, Joshua; Barbieri, Marco; Datta, Animesh; Walmsley, Ian A

2014-09-26

We develop a scheme for time-frequency encoded continuous-variable cluster-state quantum computing using quantum memories. In particular, we propose a method to produce, manipulate, and measure two-dimensional cluster states in a single spatial mode by exploiting the intrinsic time-frequency selectivity of Raman quantum memories. Time-frequency encoding enables the scheme to be extremely compact, requiring a number of memories that are a linear function of only the number of different frequencies in which the computational state is encoded, independent of its temporal duration. We therefore show that quantum memories can be a powerful component for scalable photonic quantum information processing architectures.
Technical support for digital systems technology development. Task order 1: ISP contention analysis and control

NASA Technical Reports Server (NTRS)

Stehle, Roy H.; Ogier, Richard G.

1993-01-01

Alternatives for realizing a packet-based network switch for use on a frequency division multiple access/time division multiplexed (FDMA/TDM) geostationary communication satellite were investigated. Each of the eight downlink beams supports eight directed dwells. The design needed to accommodate multicast packets with very low probability of loss due to contention. Three switch architectures were designed and analyzed. An output-queued, shared bus system yielded a functionally simple system, utilizing a first-in, first-out (FIFO) memory per downlink dwell, but at the expense of a large total memory requirement. A shared memory architecture offered the most efficiency in memory requirements, requiring about half the memory of the shared bus design. The processing requirement for the shared-memory system adds system complexity that may offset the benefits of the smaller memory. An alternative design using a shared memory buffer per downlink beam decreases circuit complexity through a distributed design, and requires at most 1000 packets of memory more than the completely shared memory design. Modifications to the basic packet switch designs were proposed to accommodate circuit-switched traffic, which must be served on a periodic basis with minimal delay. Methods for dynamically controlling the downlink dwell lengths were developed and analyzed. These methods adapt quickly to changing traffic demands, and do not add significant complexity or cost to the satellite and ground station designs. Methods for reducing the memory requirement by not requiring the satellite to store full packets were also proposed and analyzed. In addition, optimal packet and dwell lengths were computed as functions of memory size for the three switch architectures.
CXRO - The Center for X-ray Optics

Science.gov Websites

advanced experimental systems to address national needs, support research in material, life, and provides a stepping stone for realizing stable and highly scalable (10 nm and below) non-volatile memory
DOE Office of Scientific and Technical Information (OSTI.GOV)

Edwards, Harold C.; Ibanez, Daniel Alejandro

This report documents the ASC/ATDM Kokkos deliverable "Production Portable Dy- namic Task DAG Capability." This capability enables applications to create and execute a dynamic task DAG ; a collection of heterogeneous computational tasks with a directed acyclic graph (DAG) of "execute after" dependencies where tasks and their dependencies are dynamically created and destroyed as tasks execute. The Kokkos task scheduler executes the dynamic task DAG on the target execution resource; e.g. a multicore CPU, a manycore CPU such as Intel's Knights Landing (KNL), or an NVIDIA GPU. Several major technical challenges had to be addressed during development of Kokkos' Taskmore » DAG capability: (1) portability to a GPU with it's simplified hardware and micro- runtime, (2) thread-scalable memory allocation and deallocation from a bounded pool of memory, (3) thread-scalable scheduler for dynamic task DAG, (4) usability by applications.« less
Hi-Corrector: a fast, scalable and memory-efficient package for normalizing large-scale Hi-C data.

PubMed

Li, Wenyuan; Gong, Ke; Li, Qingjiao; Alber, Frank; Zhou, Xianghong Jasmine

2015-03-15

Genome-wide proximity ligation assays, e.g. Hi-C and its variant TCC, have recently become important tools to study spatial genome organization. Removing biases from chromatin contact matrices generated by such techniques is a critical preprocessing step of subsequent analyses. The continuing decline of sequencing costs has led to an ever-improving resolution of the Hi-C data, resulting in very large matrices of chromatin contacts. Such large-size matrices, however, pose a great challenge on the memory usage and speed of its normalization. Therefore, there is an urgent need for fast and memory-efficient methods for normalization of Hi-C data. We developed Hi-Corrector, an easy-to-use, open source implementation of the Hi-C data normalization algorithm. Its salient features are (i) scalability-the software is capable of normalizing Hi-C data of any size in reasonable times; (ii) memory efficiency-the sequential version can run on any single computer with very limited memory, no matter how little; (iii) fast speed-the parallel version can run very fast on multiple computing nodes with limited local memory. The sequential version is implemented in ANSI C and can be easily compiled on any system; the parallel version is implemented in ANSI C with the MPI library (a standardized and portable parallel environment designed for solving large-scale scientific problems). The package is freely available at http://zhoulab.usc.edu/Hi-Corrector/. © The Author 2014. Published by Oxford University Press.
Carbon nanomaterials for non-volatile memories

NASA Astrophysics Data System (ADS)

Ahn, Ethan C.; Wong, H.-S. Philip; Pop, Eric

2018-03-01

Carbon can create various low-dimensional nanostructures with remarkable electronic, optical, mechanical and thermal properties. These features make carbon nanomaterials especially interesting for next-generation memory and storage devices, such as resistive random access memory, phase-change memory, spin-transfer-torque magnetic random access memory and ferroelectric random access memory. Non-volatile memories greatly benefit from the use of carbon nanomaterials in terms of bit density and energy efficiency. In this Review, we discuss sp2-hybridized carbon-based low-dimensional nanostructures, such as fullerene, carbon nanotubes and graphene, in the context of non-volatile memory devices and architectures. Applications of carbon nanomaterials as memory electrodes, interfacial engineering layers, resistive-switching media, and scalable, high-performance memory selectors are investigated. Finally, we compare the different memory technologies in terms of writing energy and time, and highlight major challenges in the manufacturing, integration and understanding of the physical mechanisms and material properties.
Scalable Cloning on Large-Scale GPU Platforms with Application to Time-Stepped Simulations on Grids

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoginath, Srikanth B.; Perumalla, Kalyan S.

Cloning is a technique to efficiently simulate a tree of multiple what-if scenarios that are unraveled during the course of a base simulation. However, cloned execution is highly challenging to realize on large, distributed memory computing platforms, due to the dynamic nature of the computational load across clones, and due to the complex dependencies spanning the clone tree. In this paper, we present the conceptual simulation framework, algorithmic foundations, and runtime interface of CloneX, a new system we designed for scalable simulation cloning. It efficiently and dynamically creates whole logical copies of a dynamic tree of simulations across a largemore » parallel system without full physical duplication of computation and memory. The performance of a prototype implementation executed on up to 1,024 graphical processing units of a supercomputing system has been evaluated with three benchmarks—heat diffusion, forest fire, and disease propagation models—delivering a speed up of over two orders of magnitude compared to replicated runs. Finally, the results demonstrate a significantly faster and scalable way to execute many what-if scenario ensembles of large simulations via cloning using the CloneX interface.« less
Direct Laser Writing-Based Programmable Transfer Printing via Bioinspired Shape Memory Reversible Adhesive.

PubMed

Huang, Yin; Zheng, Ning; Cheng, Zhiqiang; Chen, Ying; Lu, Bingwei; Xie, Tao; Feng, Xue

2016-12-28

Flexible and stretchable electronics offer a wide range of unprecedented opportunities beyond conventional rigid electronics. Despite their vast promise, a significant bottleneck lies in the availability of a transfer printing technique to manufacture such devices in a highly controllable and scalable manner. Current technologies usually rely on manual stick-and-place and do not offer feasible mechanisms for precise and quantitative process control, especially when scalability is taken into account. Here, we demonstrate a spatioselective and programmable transfer strategy to print electronic microelements onto a soft substrate. The method takes advantage of automated direct laser writing to trigger localized heating of a micropatterned shape memory polymer adhesive stamp, allowing highly controlled and spatioselective switching of the interfacial adhesion. This, coupled to the proper tuning of the stamp properties, enables printing with perfect yield. The wide range adhesion switchability further allows printing of hybrid electronic elements, which is otherwise challenging given the complex interfacial manipulation involved. Our temperature-controlled transfer printing technique shows its critical importance and obvious advantages in the potential scale-up of device manufacturing. Our strategy opens a route to manufacturing flexible electronics with exceptional versatility and potential scalability.
Scalable Cloning on Large-Scale GPU Platforms with Application to Time-Stepped Simulations on Grids

DOE PAGES

Yoginath, Srikanth B.; Perumalla, Kalyan S.

2018-01-31

Cloning is a technique to efficiently simulate a tree of multiple what-if scenarios that are unraveled during the course of a base simulation. However, cloned execution is highly challenging to realize on large, distributed memory computing platforms, due to the dynamic nature of the computational load across clones, and due to the complex dependencies spanning the clone tree. In this paper, we present the conceptual simulation framework, algorithmic foundations, and runtime interface of CloneX, a new system we designed for scalable simulation cloning. It efficiently and dynamically creates whole logical copies of a dynamic tree of simulations across a largemore » parallel system without full physical duplication of computation and memory. The performance of a prototype implementation executed on up to 1,024 graphical processing units of a supercomputing system has been evaluated with three benchmarks—heat diffusion, forest fire, and disease propagation models—delivering a speed up of over two orders of magnitude compared to replicated runs. Finally, the results demonstrate a significantly faster and scalable way to execute many what-if scenario ensembles of large simulations via cloning using the CloneX interface.« less
A distributed-memory approximation algorithm for maximum weight perfect bipartite matching

DOE Office of Scientific and Technical Information (OSTI.GOV)

Azad, Ariful; Buluc, Aydin; Li, Xiaoye S.

We design and implement an efficient parallel approximation algorithm for the problem of maximum weight perfect matching in bipartite graphs, i.e. the problem of finding a set of non-adjacent edges that covers all vertices and has maximum weight. This problem differs from the maximum weight matching problem, for which scalable approximation algorithms are known. It is primarily motivated by finding good pivots in scalable sparse direct solvers before factorization where sequential implementations of maximum weight perfect matching algorithms, such as those available in MC64, are widely used due to the lack of scalable alternatives. To overcome this limitation, we proposemore » a fully parallel distributed memory algorithm that first generates a perfect matching and then searches for weightaugmenting cycles of length four in parallel and iteratively augments the matching with a vertex disjoint set of such cycles. For most practical problems the weights of the perfect matchings generated by our algorithm are very close to the optimum. An efficient implementation of the algorithm scales up to 256 nodes (17,408 cores) on a Cray XC40 supercomputer and can solve instances that are too large to be handled by a single node using the sequential algorithm.« less
An Ephemeral Burst-Buffer File System for Scientific Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wang, Teng; Moody, Adam; Yu, Weikuan

BurstFS is a distributed file system for node-local burst buffers on high performance computing systems. BurstFS presents a shared file system space across the burst buffers so that applications that use shared files can access the highly-scalable burst buffers without changing their applications.
Testing New Programming Paradigms with NAS Parallel Benchmarks

NASA Technical Reports Server (NTRS)

Jin, H.; Frumkin, M.; Schultz, M.; Yan, J.

2000-01-01

Over the past decade, high performance computing has evolved rapidly, not only in hardware architectures but also with increasing complexity of real applications. Technologies have been developing to aim at scaling up to thousands of processors on both distributed and shared memory systems. Development of parallel programs on these computers is always a challenging task. Today, writing parallel programs with message passing (e.g. MPI) is the most popular way of achieving scalability and high performance. However, writing message passing programs is difficult and error prone. Recent years new effort has been made in defining new parallel programming paradigms. The best examples are: HPF (based on data parallelism) and OpenMP (based on shared memory parallelism). Both provide simple and clear extensions to sequential programs, thus greatly simplify the tedious tasks encountered in writing message passing programs. HPF is independent of memory hierarchy, however, due to the immaturity of compiler technology its performance is still questionable. Although use of parallel compiler directives is not new, OpenMP offers a portable solution in the shared-memory domain. Another important development involves the tremendous progress in the internet and its associated technology. Although still in its infancy, Java promisses portability in a heterogeneous environment and offers possibility to "compile once and run anywhere." In light of testing these new technologies, we implemented new parallel versions of the NAS Parallel Benchmarks (NPBs) with HPF and OpenMP directives, and extended the work with Java and Java-threads. The purpose of this study is to examine the effectiveness of alternative programming paradigms. NPBs consist of five kernels and three simulated applications that mimic the computation and data movement of large scale computational fluid dynamics (CFD) applications. We started with the serial version included in NPB2.3. Optimization of memory and cache usage was applied to several benchmarks, noticeably BT and SP, resulting in better sequential performance. In order to overcome the lack of an HPF performance model and guide the development of the HPF codes, we employed an empirical performance model for several primitives found in the benchmarks. We encountered a few limitations of HPF, such as lack of supporting the "REDISTRIBUTION" directive and no easy way to handle irregular computation. The parallelization with OpenMP directives was done at the outer-most loop level to achieve the largest granularity. The performance of six HPF and OpenMP benchmarks is compared with their MPI counterparts for the Class-A problem size in the figure in next page. These results were obtained on an SGI Origin2000 (195MHz) with MIPSpro-f77 compiler 7.2.1 for OpenMP and MPI codes and PGI pghpf-2.4.3 compiler with MPI interface for HPF programs.
The Scalable Checkpoint/Restart Library

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moody, A.

The Scalable Checkpoint/Restart (SCR) library provides an interface that codes may use to worite our and read in application-level checkpoints in a scalable fashion. In the current implementation, checkpoint files are cached in local storage (hard disk or RAM disk) on the compute nodes. This technique provides scalable aggregate bandwidth and uses storage resources that are fully dedicated to the job. This approach addresses the two common drawbacks of checkpointing a large-scale application to a shared parallel file system, namely, limited bandwidth and file system contention. In fact, on current platforms, SCR scales linearly with the number of compute nodes.more » It has been benchmarked as high as 720GB/s on 1094 nodes of Atlas, which is nearly two orders of magnitude faster thanthe parallel file system.« less
Accelerate quasi Monte Carlo method for solving systems of linear algebraic equations through shared memory

NASA Astrophysics Data System (ADS)

Lai, Siyan; Xu, Ying; Shao, Bo; Guo, Menghan; Lin, Xiaola

2017-04-01

In this paper we study on Monte Carlo method for solving systems of linear algebraic equations (SLAE) based on shared memory. Former research demostrated that GPU can effectively speed up the computations of this issue. Our purpose is to optimize Monte Carlo method simulation on GPUmemoryachritecture specifically. Random numbers are organized to storein shared memory, which aims to accelerate the parallel algorithm. Bank conflicts can be avoided by our Collaborative Thread Arrays(CTA)scheme. The results of experiments show that the shared memory based strategy can speed up the computaions over than 3X at most.

CaLRS: A Critical-Aware Shared LLC Request Scheduling Algorithm on GPGPU

PubMed Central

Ma, Jianliang; Meng, Jinglei; Chen, Tianzhou; Wu, Minghui

2015-01-01

Ultra high thread-level parallelism in modern GPUs usually introduces numerous memory requests simultaneously. So there are always plenty of memory requests waiting at each bank of the shared LLC (L2 in this paper) and global memory. For global memory, various schedulers have already been developed to adjust the request sequence. But we find few work has ever focused on the service sequence on the shared LLC. We measured that a big number of GPU applications always queue at LLC bank for services, which provide opportunity to optimize the service order on LLC. Through adjusting the GPU memory request service order, we can improve the schedulability of SM. So we proposed a critical-aware shared LLC request scheduling algorithm (CaLRS) in this paper. The priority representative of memory request is critical for CaLRS. We use the number of memory requests that originate from the same warp but have not been serviced when they arrive at the shared LLC bank to represent the criticality of each warp. Experiments show that the proposed scheme can boost the SM schedulability effectively by promoting the scheduling priority of the memory requests with high criticality and improves the performance of GPU indirectly. PMID:25729772
A Multi-Level Parallelization Concept for High-Fidelity Multi-Block Solvers

NASA Technical Reports Server (NTRS)

Hatay, Ferhat F.; Jespersen, Dennis C.; Guruswamy, Guru P.; Rizk, Yehia M.; Byun, Chansup; Gee, Ken; VanDalsem, William R. (Technical Monitor)

1997-01-01

The integration of high-fidelity Computational Fluid Dynamics (CFD) analysis tools with the industrial design process benefits greatly from the robust implementations that are transportable across a wide range of computer architectures. In the present work, a hybrid domain-decomposition and parallelization concept was developed and implemented into the widely-used NASA multi-block Computational Fluid Dynamics (CFD) packages implemented in ENSAERO and OVERFLOW. The new parallel solver concept, PENS (Parallel Euler Navier-Stokes Solver), employs both fine and coarse granularity in data partitioning as well as data coalescing to obtain the desired load-balance characteristics on the available computer platforms. This multi-level parallelism implementation itself introduces no changes to the numerical results, hence the original fidelity of the packages are identically preserved. The present implementation uses the Message Passing Interface (MPI) library for interprocessor message passing and memory accessing. By choosing an appropriate combination of the available partitioning and coalescing capabilities only during the execution stage, the PENS solver becomes adaptable to different computer architectures from shared-memory to distributed-memory platforms with varying degrees of parallelism. The PENS implementation on the IBM SP2 distributed memory environment at the NASA Ames Research Center obtains 85 percent scalable parallel performance using fine-grain partitioning of single-block CFD domains using up to 128 wide computational nodes. Multi-block CFD simulations of complete aircraft simulations achieve 75 percent perfect load-balanced executions using data coalescing and the two levels of parallelism. SGI PowerChallenge, SGI Origin 2000, and a cluster of workstations are the other platforms where the robustness of the implementation is tested. The performance behavior on the other computer platforms with a variety of realistic problems will be included as this on-going study progresses.
Time Constraints and Resource Sharing in Adults' Working Memory Spans

ERIC Educational Resources Information Center

Barrouillet, Pierre; Bernardin, Sophie; Camos, Valerie

2004-01-01

This article presents a new model that accounts for working memory spans in adults, the time-based resource-sharing model. The model assumes that both components (i.e., processing and maintenance) of the main working memory tasks require attention and that memory traces decay as soon as attention is switched away. Because memory retrievals are…
Quantum memories and Landauer's principle

NASA Astrophysics Data System (ADS)

Alicki, Robert

2011-10-01

Two types of arguments concerning (im)possibility of constructing a scalable, exponentially stable quantum memory equipped with Hamiltonian controls are discussed. The first type concerns ergodic properties of open Kitaev models which are considered as promising candidates for such memories. It is shown that, although the 4D Kitaev model provides stable qubit observables, the Hamiltonian control is not possible. The thermodynamical approach leads to the new proposal of the revised version of Landauer's principle and suggests that the existence of quantum memory implies the existence of the perpetuum mobile of the second kind. Finally, a discussion of the stability property of information and its implications is presented.
Holographic storage of biphoton entanglement.

PubMed

Dai, Han-Ning; Zhang, Han; Yang, Sheng-Jun; Zhao, Tian-Ming; Rui, Jun; Deng, You-Jin; Li, Li; Liu, Nai-Le; Chen, Shuai; Bao, Xiao-Hui; Jin, Xian-Min; Zhao, Bo; Pan, Jian-Wei

2012-05-25

Coherent and reversible storage of multiphoton entanglement with a multimode quantum memory is essential for scalable all-optical quantum information processing. Although a single photon has been successfully stored in different quantum systems, storage of multiphoton entanglement remains challenging because of the critical requirement for coherent control of the photonic entanglement source, multimode quantum memory, and quantum interface between them. Here we demonstrate a coherent and reversible storage of biphoton Bell-type entanglement with a holographic multimode atomic-ensemble-based quantum memory. The retrieved biphoton entanglement violates the Bell inequality for 1 μs storage time and a memory-process fidelity of 98% is demonstrated by quantum state tomography.
Externalising the autobiographical self: sharing personal memories online facilitated memory retention.

PubMed

Wang, Qi; Lee, Dasom; Hou, Yubo

2017-07-01

Internet technology provides a new means of recalling and sharing personal memories in the digital age. What is the mnemonic consequence of posting personal memories online? Theories of transactive memory and autobiographical memory would make contrasting predictions. In the present study, college students completed a daily diary for a week, listing at the end of each day all the events that happened to them on that day. They also reported whether they posted any of the events online. Participants received a surprise memory test after the completion of the diary recording and then another test a week later. At both tests, events posted online were significantly more likely than those not posted online to be recalled. It appears that sharing memories online may provide unique opportunities for rehearsal and meaning-making that facilitate memory retention.
Dynamic Load Balancing for Adaptive Computations on Distributed-Memory Machines

NASA Technical Reports Server (NTRS)

1999-01-01

Dynamic load balancing is central to adaptive mesh-based computations on large-scale parallel computers. The principal investigator has investigated various issues on the dynamic load balancing problem under NASA JOVE and JAG rants. The major accomplishments of the project are two graph partitioning algorithms and a load balancing framework. The S-HARP dynamic graph partitioner is known to be the fastest among the known dynamic graph partitioners to date. It can partition a graph of over 100,000 vertices in 0.25 seconds on a 64- processor Cray T3E distributed-memory multiprocessor while maintaining the scalability of over 16-fold speedup. Other known and widely used dynamic graph partitioners take over a second or two while giving low scalability of a few fold speedup on 64 processors. These results have been published in journals and peer-reviewed flagship conferences.
EOS developments

NASA Astrophysics Data System (ADS)

Sindrilaru, Elvin A.; Peters, Andreas J.; Adde, Geoffray M.; Duellmann, Dirk

2017-10-01

CERN has been developing and operating EOS as a disk storage solution successfully for over 6 years. The CERN deployment provides 135 PB and stores 1.2 billion replicas distributed over two computer centres. Deployment includes four LHC instances, a shared instance for smaller experiments and since last year an instance for individual user data as well. The user instance represents the backbone of the CERNBOX service for file sharing. New use cases like synchronisation and sharing, the planned migration to reduce AFS usage at CERN and the continuous growth has brought EOS to new challenges. Recent developments include the integration and evaluation of various technologies to do the transition from a single active in-memory namespace to a scale-out implementation distributed over many meta-data servers. The new architecture aims to separate the data from the application logic and user interface code, thus providing flexibility and scalability to the namespace component. Another important goal is to provide EOS as a CERN-wide mounted filesystem with strong authentication making it a single storage repository accessible via various services and front- ends (/eos initiative). This required new developments in the security infrastructure of the EOS FUSE implementation. Furthermore, there were a series of improvements targeting the end-user experience like tighter consistency and latency optimisations. In collaboration with Seagate as Openlab partner, EOS has a complete integration of OpenKinetic object drive cluster as a high-throughput, high-availability, low-cost storage solution. This contribution will discuss these three main development projects and present new performance metrics.
IGA-ADS: Isogeometric analysis FEM using ADS solver

NASA Astrophysics Data System (ADS)

Łoś, Marcin M.; Woźniak, Maciej; Paszyński, Maciej; Lenharth, Andrew; Hassaan, Muhamm Amber; Pingali, Keshav

2017-08-01

In this paper we present a fast explicit solver for solution of non-stationary problems using L2 projections with isogeometric finite element method. The solver has been implemented within GALOIS framework. It enables parallel multi-core simulations of different time-dependent problems, in 1D, 2D, or 3D. We have prepared the solver framework in a way that enables direct implementation of the selected PDE and corresponding boundary conditions. In this paper we describe the installation, implementation of exemplary three PDEs, and execution of the simulations on multi-core Linux cluster nodes. We consider three case studies, including heat transfer, linear elasticity, as well as non-linear flow in heterogeneous media. The presented package generates output suitable for interfacing with Gnuplot and ParaView visualization software. The exemplary simulations show near perfect scalability on Gilbert shared-memory node with four Intel® Xeon® CPU E7-4860 processors, each possessing 10 physical cores (for a total of 40 cores).
Parallel Discrete Molecular Dynamics Simulation With Speculation and In-Order Commitment*†

PubMed Central

Khan, Md. Ashfaquzzaman; Herbordt, Martin C.

2011-01-01

Discrete molecular dynamics simulation (DMD) uses simplified and discretized models enabling simulations to advance by event rather than by timestep. DMD is an instance of discrete event simulation and so is difficult to scale: even in this multi-core era, all reported DMD codes are serial. In this paper we discuss the inherent difficulties of scaling DMD and present our method of parallelizing DMD through event-based decomposition. Our method is microarchitecture inspired: speculative processing of events exposes parallelism, while in-order commitment ensures correctness. We analyze the potential of this parallelization method for shared-memory multiprocessors. Achieving scalability required extensive experimentation with scheduling and synchronization methods to mitigate serialization. The speed-up achieved for a variety of system sizes and complexities is nearly 6× on an 8-core and over 9× on a 12-core processor. We present and verify analytical models that account for the achieved performance as a function of available concurrency and architectural limitations. PMID:21822327
Parallel Discrete Molecular Dynamics Simulation With Speculation and In-Order Commitment.

PubMed

Khan, Md Ashfaquzzaman; Herbordt, Martin C

2011-07-20

Discrete molecular dynamics simulation (DMD) uses simplified and discretized models enabling simulations to advance by event rather than by timestep. DMD is an instance of discrete event simulation and so is difficult to scale: even in this multi-core era, all reported DMD codes are serial. In this paper we discuss the inherent difficulties of scaling DMD and present our method of parallelizing DMD through event-based decomposition. Our method is microarchitecture inspired: speculative processing of events exposes parallelism, while in-order commitment ensures correctness. We analyze the potential of this parallelization method for shared-memory multiprocessors. Achieving scalability required extensive experimentation with scheduling and synchronization methods to mitigate serialization. The speed-up achieved for a variety of system sizes and complexities is nearly 6× on an 8-core and over 9× on a 12-core processor. We present and verify analytical models that account for the achieved performance as a function of available concurrency and architectural limitations.
From photons to phonons and back: a THz optical memory in diamond.

PubMed

England, D G; Bustard, P J; Nunn, J; Lausten, R; Sussman, B J

2013-12-13

Optical quantum memories are vital for the scalability of future quantum technologies, enabling long-distance secure communication and local synchronization of quantum components. We demonstrate a THz-bandwidth memory for light using the optical phonon modes of a room temperature diamond. This large bandwidth makes the memory compatible with down-conversion-type photon sources. We demonstrate that four-wave mixing noise in this system is suppressed by material dispersion. The resulting noise floor is just 7×10(-3) photons per pulse, which establishes that the memory is capable of storing single quanta. We investigate the principle sources of noise in this system and demonstrate that high material dispersion can be used to suppress four-wave mixing noise in Λ-type systems.
Cheetah: A Framework for Scalable Hierarchical Collective Operations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Graham, Richard L; Gorentla Venkata, Manjunath; Ladd, Joshua S

2011-01-01

Collective communication operations, used by many scientific applications, tend to limit overall parallel application performance and scalability. Computer systems are becoming more heterogeneous with increasing node and core-per-node counts. Also, a growing number of data-access mechanisms, of varying characteristics, are supported within a single computer system. We describe a new hierarchical collective communication framework that takes advantage of hardware-specific data-access mechanisms. It is flexible, with run-time hierarchy specification, and sharing of collective communication primitives between collective algorithms. Data buffers are shared between levels in the hierarchy reducing collective communication management overhead. We have implemented several versions of the Message Passingmore » Interface (MPI) collective operations, MPI Barrier() and MPI Bcast(), and run experiments using up to 49, 152 processes on a Cray XT5, and a small InfiniBand based cluster. At 49, 152 processes our barrier implementation outperforms the optimized native implementation by 75%. 32 Byte and one Mega-Byte broadcasts outperform it by 62% and 11%, respectively, with better scalability characteristics. Improvements relative to the default Open MPI implementation are much larger.« less
Input-independent, Scalable and Fast String Matching on the Cray XMT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Villa, Oreste; Chavarría-Miranda, Daniel; Maschhoff, Kristyn J

2009-05-25

String searching is at the core of many security and network applications like search engines, intrusion detection systems, virus scanners and spam filters. The growing size of on-line content and the increasing wire speeds push the need for fast, and often real- time, string searching solutions. For these conditions, many software implementations (if not all) targeting conventional cache-based microprocessors do not perform well. They either exhibit overall low performance or exhibit highly variable performance depending on the types of inputs. For this reason, real-time state of the art solutions rely on the use of either custom hardware or Field-Programmable Gatemore » Arrays (FPGAs) at the expense of overall system flexibility and programmability. This paper presents a software based implementation of the Aho-Corasick string searching algorithm on the Cray XMT multithreaded shared memory machine. Our so- lution relies on the particular features of the XMT architecture and on several algorith- mic strategies: it is fast, scalable and its performance is virtually content-independent. On a 128-processor Cray XMT, it reaches a scanning speed of ≈ 28 Gbps with a performance variability below 10 %. In the 10 Gbps performance range, variability is below 2.5%. By comparison, an Intel dual-socket, 8-core system running at 2.66 GHz achieves a peak performance which varies from 500 Mbps to 10 Gbps depending on the type of input and dictionary size.« less
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

DOE Office of Scientific and Technical Information (OSTI.GOV)

Azad, Ariful; Buluc, Aydn; Pothen, Alex

It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

DOE PAGES

Azad, Ariful; Buluc, Aydn; Pothen, Alex

2016-03-24

It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5

NASA Astrophysics Data System (ADS)

Clay, M. P.; Buaria, D.; Yeung, P. K.; Gotoh, T.

2018-07-01

This paper reports on the successful implementation of a massively parallel GPU-accelerated algorithm for the direct numerical simulation of turbulent mixing at high Schmidt number. The work stems from a recent development (Comput. Phys. Commun., vol. 219, 2017, 313-328), in which a low-communication algorithm was shown to attain high degrees of scalability on the Cray XE6 architecture when overlapping communication and computation via dedicated communication threads. An even higher level of performance has now been achieved using OpenMP 4.5 on the Cray XK7 architecture, where on each node the 16 integer cores of an AMD Interlagos processor share a single Nvidia K20X GPU accelerator. In the new algorithm, data movements are minimized by performing virtually all of the intensive scalar field computations in the form of combined compact finite difference (CCD) operations on the GPUs. A memory layout in departure from usual practices is found to provide much better performance for a specific kernel required to apply the CCD scheme. Asynchronous execution enabled by adding the OpenMP 4.5 NOWAIT clause to TARGET constructs improves scalability when used to overlap computation on the GPUs with computation and communication on the CPUs. On the 27-petaflops supercomputer Titan at Oak Ridge National Laboratory, USA, a GPU-to-CPU speedup factor of approximately 5 is consistently observed at the largest problem size of 81923 grid points for the scalar field computed with 8192 XK7 nodes.
Distributed-Memory Breadth-First Search on Massive Graphs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Buluc, Aydin; Beamer, Scott; Madduri, Kamesh

This chapter studies the problem of traversing large graphs using the breadth-first search order on distributed-memory supercomputers. We consider both the traditional level-synchronous top-down algorithm as well as the recently discovered direction optimizing algorithm. We analyze the performance and scalability trade-offs in using different local data structures such as CSR and DCSC, enabling in-node multithreading, and graph decompositions such as 1D and 2D decomposition.
A look at scalable dense linear algebra libraries

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dongarra, J.J.; Van de Geijn, R.A.; Walker, D.W.

1992-01-01

We discuss the essential design features of a library of scalable software for performing dense linear algebra computations on distributed memory concurrent computers. The square block scattered decomposition is proposed as a flexible and general-purpose way of decomposing most, if not all, dense matrix problems. An object- oriented interface to the library permits more portable applications to be written, and is easy to learn and use, since details of the parallel implementation are hidden from the user. Experiments on the Intel Touchstone Delta system with a prototype code that uses the square block scattered decomposition to perform LU factorization aremore » presented and analyzed. It was found that the code was both scalable and efficient, performing at about 14 GFLOPS (double precision) for the largest problem considered.« less
A look at scalable dense linear algebra libraries

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dongarra, J.J.; Van de Geijn, R.A.; Walker, D.W.

1992-08-01

We discuss the essential design features of a library of scalable software for performing dense linear algebra computations on distributed memory concurrent computers. The square block scattered decomposition is proposed as a flexible and general-purpose way of decomposing most, if not all, dense matrix problems. An object- oriented interface to the library permits more portable applications to be written, and is easy to learn and use, since details of the parallel implementation are hidden from the user. Experiments on the Intel Touchstone Delta system with a prototype code that uses the square block scattered decomposition to perform LU factorization aremore » presented and analyzed. It was found that the code was both scalable and efficient, performing at about 14 GFLOPS (double precision) for the largest problem considered.« less

A Systems Approach to Scalable Transportation Network Modeling

DOE Office of Scientific and Technical Information (OSTI.GOV)

Perumalla, Kalyan S

2006-01-01

Emerging needs in transportation network modeling and simulation are raising new challenges with respect to scal-ability of network size and vehicular traffic intensity, speed of simulation for simulation-based optimization, and fidel-ity of vehicular behavior for accurate capture of event phe-nomena. Parallel execution is warranted to sustain the re-quired detail, size and speed. However, few parallel simulators exist for such applications, partly due to the challenges underlying their development. Moreover, many simulators are based on time-stepped models, which can be computationally inefficient for the purposes of modeling evacuation traffic. Here an approach is presented to de-signing a simulator with memory andmore » speed efficiency as the goals from the outset, and, specifically, scalability via parallel execution. The design makes use of discrete event modeling techniques as well as parallel simulation meth-ods. Our simulator, called SCATTER, is being developed, incorporating such design considerations. Preliminary per-formance results are presented on benchmark road net-works, showing scalability to one million vehicles simu-lated on one processor.« less
Motivation for Knowledge Sharing by Expert Participants in Company-Hosted Online User Communities

ERIC Educational Resources Information Center

Cheng, Jingli

2014-01-01

Company-hosted online user communities are increasingly popular as firms continue to search for ways to provide their customers with high quality and reliable support in a low cost and scalable way. Yet, empirical understanding of motivations for knowledge sharing in this type of online communities is lacking, especially with regard to an…
Shared versus distributed memory multiprocessors

NASA Technical Reports Server (NTRS)

Jordan, Harry F.

1991-01-01

The question of whether multiprocessors should have shared or distributed memory has attracted a great deal of attention. Some researchers argue strongly for building distributed memory machines, while others argue just as strongly for programming shared memory multiprocessors. A great deal of research is underway on both types of parallel systems. Special emphasis is placed on systems with a very large number of processors for computation intensive tasks and considers research and implementation trends. It appears that the two types of systems will likely converge to a common form for large scale multiprocessors.
Visual and Spatial Working Memory Are Not that Dissociated after All: A Time-Based Resource-Sharing Account

ERIC Educational Resources Information Center

Vergauwe, Evie; Barrouillet, Pierre; Camos, Valerie

2009-01-01

Examinations of interference between visual and spatial materials in working memory have suggested domain- and process-based fractionations of visuo-spatial working memory. The present study examined the role of central time-based resource sharing in visuo-spatial working memory and assessed its role in obtained interference patterns. Visual and…
A Cross-Platform Infrastructure for Scalable Runtime Application Performance Analysis

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jack Dongarra; Shirley Moore; Bart Miller, Jeffrey Hollingsworth

2005-03-15

The purpose of this project was to build an extensible cross-platform infrastructure to facilitate the development of accurate and portable performance analysis tools for current and future high performance computing (HPC) architectures. Major accomplishments include tools and techniques for multidimensional performance analysis, as well as improved support for dynamic performance monitoring of multithreaded and multiprocess applications. Previous performance tool development has been limited by the burden of having to re-write a platform-dependent low-level substrate for each architecture/operating system pair in order to obtain the necessary performance data from the system. Manual interpretation of performance data is not scalable for large-scalemore » long-running applications. The infrastructure developed by this project provides a foundation for building portable and scalable performance analysis tools, with the end goal being to provide application developers with the information they need to analyze, understand, and tune the performance of terascale applications on HPC architectures. The backend portion of the infrastructure provides runtime instrumentation capability and access to hardware performance counters, with thread-safety for shared memory environments and a communication substrate to support instrumentation of multiprocess and distributed programs. Front end interfaces provides tool developers with a well-defined, platform-independent set of calls for requesting performance data. End-user tools have been developed that demonstrate runtime data collection, on-line and off-line analysis of performance data, and multidimensional performance analysis. The infrastructure is based on two underlying performance instrumentation technologies. These technologies are the PAPI cross-platform library interface to hardware performance counters and the cross-platform Dyninst library interface for runtime modification of executable images. The Paradyn and KOJAK projects have made use of this infrastructure to build performance measurement and analysis tools that scale to long-running programs on large parallel and distributed systems and that automate much of the search for performance bottlenecks.« less
Studying an Eulerian Computer Model on Different High-performance Computer Platforms and Some Applications

NASA Astrophysics Data System (ADS)

Georgiev, K.; Zlatev, Z.

2010-11-01

The Danish Eulerian Model (DEM) is an Eulerian model for studying the transport of air pollutants on large scale. Originally, the model was developed at the National Environmental Research Institute of Denmark. The model computational domain covers Europe and some neighbour parts belong to the Atlantic Ocean, Asia and Africa. If DEM model is to be applied by using fine grids, then its discretization leads to a huge computational problem. This implies that such a model as DEM must be run only on high-performance computer architectures. The implementation and tuning of such a complex large-scale model on each different computer is a non-trivial task. Here, some comparison results of running of this model on different kind of vector (CRAY C92A, Fujitsu, etc.), parallel computers with distributed memory (IBM SP, CRAY T3E, Beowulf clusters, Macintosh G4 clusters, etc.), parallel computers with shared memory (SGI Origin, SUN, etc.) and parallel computers with two levels of parallelism (IBM SMP, IBM BlueGene/P, clusters of multiprocessor nodes, etc.) will be presented. The main idea in the parallel version of DEM is domain partitioning approach. Discussions according to the effective use of the cache and hierarchical memories of the modern computers as well as the performance, speed-ups and efficiency achieved will be done. The parallel code of DEM, created by using MPI standard library, appears to be highly portable and shows good efficiency and scalability on different kind of vector and parallel computers. Some important applications of the computer model output are presented in short.
Rapid recovery from transient faults in the fault-tolerant processor with fault-tolerant shared memory

NASA Technical Reports Server (NTRS)

Harper, Richard E.; Butler, Bryan P.

1990-01-01

The Draper fault-tolerant processor with fault-tolerant shared memory (FTP/FTSM), which is designed to allow application tasks to continue execution during the memory alignment process, is described. Processor performance is not affected by memory alignment. In addition, the FTP/FTSM incorporates a hardware scrubber device to perform the memory alignment quickly during unused memory access cycles. The FTP/FTSM architecture is described, followed by an estimate of the time required for channel reintegration.
Relations of maternal style and child self-concept to autobiographical memories in chinese, chinese immigrant, and European american 3-year-olds.

PubMed

Wang, Qi

2006-01-01

The relations of maternal reminiscing style and child self-concept to children's shared and independent autobiographical memories were examined in a sample of 189 three-year-olds and their mothers from Chinese families in China, first-generation Chinese immigrant families in the United States, and European American families. Mothers shared memories with their children and completed questionnaires; children recounted autobiographical events and described themselves with a researcher. Independent of culture, gender, child age, and language skills, maternal elaborations and evaluations were associated with children's shared memory reports, and maternal evaluations and child agentic self-focus were associated with children's independent memory reports. Maternal style and child self-concept further mediated cultural influences on children's memory. The findings provide insight into the social-cultural construction of autobiographical memory.
Scalable domain decomposition solvers for stochastic PDEs in high performance computing

DOE PAGES

Desai, Ajit; Khalil, Mohammad; Pettit, Chris; ...

2017-09-21

Stochastic spectral finite element models of practical engineering systems may involve solutions of linear systems or linearized systems for non-linear problems with billions of unknowns. For stochastic modeling, it is therefore essential to design robust, parallel and scalable algorithms that can efficiently utilize high-performance computing to tackle such large-scale systems. Domain decomposition based iterative solvers can handle such systems. And though these algorithms exhibit excellent scalabilities, significant algorithmic and implementational challenges exist to extend them to solve extreme-scale stochastic systems using emerging computing platforms. Intrusive polynomial chaos expansion based domain decomposition algorithms are extended here to concurrently handle high resolutionmore » in both spatial and stochastic domains using an in-house implementation. Sparse iterative solvers with efficient preconditioners are employed to solve the resulting global and subdomain level local systems through multi-level iterative solvers. We also use parallel sparse matrix–vector operations to reduce the floating-point operations and memory requirements. Numerical and parallel scalabilities of these algorithms are presented for the diffusion equation having spatially varying diffusion coefficient modeled by a non-Gaussian stochastic process. Scalability of the solvers with respect to the number of random variables is also investigated.« less
Scalable domain decomposition solvers for stochastic PDEs in high performance computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Desai, Ajit; Khalil, Mohammad; Pettit, Chris

Stochastic spectral finite element models of practical engineering systems may involve solutions of linear systems or linearized systems for non-linear problems with billions of unknowns. For stochastic modeling, it is therefore essential to design robust, parallel and scalable algorithms that can efficiently utilize high-performance computing to tackle such large-scale systems. Domain decomposition based iterative solvers can handle such systems. And though these algorithms exhibit excellent scalabilities, significant algorithmic and implementational challenges exist to extend them to solve extreme-scale stochastic systems using emerging computing platforms. Intrusive polynomial chaos expansion based domain decomposition algorithms are extended here to concurrently handle high resolutionmore » in both spatial and stochastic domains using an in-house implementation. Sparse iterative solvers with efficient preconditioners are employed to solve the resulting global and subdomain level local systems through multi-level iterative solvers. We also use parallel sparse matrix–vector operations to reduce the floating-point operations and memory requirements. Numerical and parallel scalabilities of these algorithms are presented for the diffusion equation having spatially varying diffusion coefficient modeled by a non-Gaussian stochastic process. Scalability of the solvers with respect to the number of random variables is also investigated.« less
Supporting shared data structures on distributed memory architectures

NASA Technical Reports Server (NTRS)

Koelbel, Charles; Mehrotra, Piyush; Vanrosendale, John

1990-01-01

Programming nonshared memory systems is more difficult than programming shared memory systems, since there is no support for shared data structures. Current programming languages for distributed memory architectures force the user to decompose all data structures into separate pieces, with each piece owned by one of the processors in the machine, and with all communication explicitly specified by low-level message-passing primitives. A new programming environment is presented for distributed memory architectures, providing a global name space and allowing direct access to remote parts of data values. The analysis and program transformations required to implement this environment are described, and the efficiency of the resulting code on the NCUBE/7 and IPSC/2 hypercubes are described.
Performance Evaluation of Remote Memory Access (RMA) Programming on Shared Memory Parallel Computers

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Jost, Gabriele; Biegel, Bryan A. (Technical Monitor)

2002-01-01

The purpose of this study is to evaluate the feasibility of remote memory access (RMA) programming on shared memory parallel computers. We discuss different RMA based implementations of selected CFD application benchmark kernels and compare them to corresponding message passing based codes. For the message-passing implementation we use MPI point-to-point and global communication routines. For the RMA based approach we consider two different libraries supporting this programming model. One is a shared memory parallelization library (SMPlib) developed at NASA Ames, the other is the MPI-2 extensions to the MPI Standard. We give timing comparisons for the different implementation strategies and discuss the performance.
SAHAYOG: A Testbed for Load Sharing under Failure,

DTIC Science & Technology

1987-07-01

messages, shared memory and semaphores . To communicate using messages, processes create message queues using system-provided prim- itives. The message...The size of the memory that is to be shared is decided by the process when it makes a request for memory allocation. The semaphore option of IPC can be...used to prevent two or more concurrent processes from executing their critical sections at the same time. Semaphores must be used when the processes
Implementation and evaluation of shared-memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Buntinas, D.; Mercier, G.; Gropp, W.

2007-09-01

This paper presents the implementation of MPICH2 over the Nemesis communication subsystem and the evaluation of its shared-memory performance. We describe design issues as well as some of the optimization techniques we employed. We conducted a performance evaluation over shared memory using microbenchmarks. The evaluation shows that MPICH2 Nemesis has very low communication overhead, making it suitable for smaller-grained applications.
An Element-Based Concurrent Partitioner for Unstructured Finite Element Meshes

NASA Technical Reports Server (NTRS)

Ding, Hong Q.; Ferraro, Robert D.

1996-01-01

A concurrent partitioner for partitioning unstructured finite element meshes on distributed memory architectures is developed. The partitioner uses an element-based partitioning strategy. Its main advantage over the more conventional node-based partitioning strategy is its modular programming approach to the development of parallel applications. The partitioner first partitions element centroids using a recursive inertial bisection algorithm. Elements and nodes then migrate according to the partitioned centroids, using a data request communication template for unpredictable incoming messages. Our scalable implementation is contrasted to a non-scalable implementation which is a straightforward parallelization of a sequential partitioner.
Comparison of two paradigms for distributed shared memory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Levelt, W.G.; Kaashoek, M.F.; Bal, H.E.

1990-08-01

The paper compares two paradigms for Distributed Shared Memory on loosely coupled computing systems: the shared data-object model as used in Orca, a programming language specially designed for loosely coupled computing systems and the Shared Virtual Memory model. For both paradigms the authors have implemented two systems, one using only point-to-point messages, the other using broadcasting as well. They briefly describe these two paradigms and their implementations. Then they compare their performance on four applications: the traveling salesman problem, alpha-beta search, matrix multiplication and the all pairs shortest paths problem. The measurements show that both paradigms can be used efficientlymore » for programming large-grain parallel applications. Significant speedups were obtained on all applications. The unstructured Shared Virtual Memory paradigm achieves the best absolute performance, although this is largely due to the preliminary nature of the Orca compiler used. The structured shared data-object model achieves the highest speedups and is much easier to program and to debug.« less
HEC Applications on Columbia Project

NASA Technical Reports Server (NTRS)

Taft, Jim

2004-01-01

NASA's Columbia system consists of a cluster of twenty 512 processor SGI Altix systems. Each of these systems is 3 TFLOP/s in peak performance - approximately the same as the entire compute capability at NAS just one year ago. Each 512p system is a single system image machine with one Linunx O5, one high performance file system, and one globally shared memory. The NAS Terascale Applications Group (TAG) is chartered to assist in scaling NASA's mission critical codes to at least 512p in order to significantly improve emergency response during flight operations, as well as provide significant improvements in the codes. and rate of scientific discovery across the scientifc disciplines within NASA's Missions. Recent accomplishments are 4x improvements to codes in the ocean modeling community, 10x performance improvements in a number of computational fluid dynamics codes used in aero-vehicle design, and 5x improvements in a number of space science codes dealing in extreme physics. The TAG group will continue its scaling work to 2048p and beyond (10240 cpus) as the Columbia system becomes fully operational and the upgrades to the SGI NUMAlink memory fabric are in place. The NUMlink uprades dramatically improve system scalability for a single application. These upgrades will allow a number of codes to execute faster at higher fidelity than ever before on any other system, thus increasing the rate of scientific discovery even further
Parallelization of TWOPORFLOW, a Cartesian Grid based Two-phase Porous Media Code for Transient Thermo-hydraulic Simulations

NASA Astrophysics Data System (ADS)

Trost, Nico; Jiménez, Javier; Imke, Uwe; Sanchez, Victor

2014-06-01

TWOPORFLOW is a thermo-hydraulic code based on a porous media approach to simulate single- and two-phase flow including boiling. It is under development at the Institute for Neutron Physics and Reactor Technology (INR) at KIT. The code features a 3D transient solution of the mass, momentum and energy conservation equations for two inter-penetrating fluids with a semi-implicit continuous Eulerian type solver. The application domain of TWOPORFLOW includes the flow in standard porous media and in structured porous media such as micro-channels and cores of nuclear power plants. In the latter case, the fluid domain is coupled to a fuel rod model, describing the heat flow inside the solid structure. In this work, detailed profiling tools have been utilized to determine the optimization potential of TWOPORFLOW. As a result, bottle-necks were identified and reduced in the most feasible way, leading for instance to an optimization of the water-steam property computation. Furthermore, an OpenMP implementation addressing the routines in charge of inter-phase momentum-, energy- and mass-coupling delivered good performance together with a high scalability on shared memory architectures. In contrast to that, the approach for distributed memory systems was to solve sub-problems resulting by the decomposition of the initial Cartesian geometry. Thread communication for the sub-problem boundary updates was accomplished by the Message Passing Interface (MPI) standard.
Working memory resources are shared across sensory modalities.

PubMed

Salmela, V R; Moisala, M; Alho, K

2014-10-01

A common assumption in the working memory literature is that the visual and auditory modalities have separate and independent memory stores. Recent evidence on visual working memory has suggested that resources are shared between representations, and that the precision of representations sets the limit for memory performance. We tested whether memory resources are also shared across sensory modalities. Memory precision for two visual (spatial frequency and orientation) and two auditory (pitch and tone duration) features was measured separately for each feature and for all possible feature combinations. Thus, only the memory load was varied, from one to four features, while keeping the stimuli similar. In Experiment 1, two gratings and two tones-both containing two varying features-were presented simultaneously. In Experiment 2, two gratings and two tones-each containing only one varying feature-were presented sequentially. The memory precision (delayed discrimination threshold) for a single feature was close to the perceptual threshold. However, as the number of features to be remembered was increased, the discrimination thresholds increased more than twofold. Importantly, the decrease in memory precision did not depend on the modality of the other feature(s), or on whether the features were in the same or in separate objects. Hence, simultaneously storing one visual and one auditory feature had an effect on memory precision equal to those of simultaneously storing two visual or two auditory features. The results show that working memory is limited by the precision of the stored representations, and that working memory can be described as a resource pool that is shared across modalities.
Distributed simulation using a real-time shared memory network

NASA Technical Reports Server (NTRS)

Simon, Donald L.; Mattern, Duane L.; Wong, Edmond; Musgrave, Jeffrey L.

1993-01-01

The Advanced Control Technology Branch of the NASA Lewis Research Center performs research in the area of advanced digital controls for aeronautic and space propulsion systems. This work requires the real-time implementation of both control software and complex dynamical models of the propulsion system. We are implementing these systems in a distributed, multi-vendor computer environment. Therefore, a need exists for real-time communication and synchronization between the distributed multi-vendor computers. A shared memory network is a potential solution which offers several advantages over other real-time communication approaches. A candidate shared memory network was tested for basic performance. The shared memory network was then used to implement a distributed simulation of a ramjet engine. The accuracy and execution time of the distributed simulation was measured and compared to the performance of the non-partitioned simulation. The ease of partitioning the simulation, the minimal time required to develop for communication between the processors and the resulting execution time all indicate that the shared memory network is a real-time communication technique worthy of serious consideration.

Scalable quantum computer architecture with coupled donor-quantum dot qubits

DOEpatents

Schenkel, Thomas; Lo, Cheuk Chi; Weis, Christoph; Lyon, Stephen; Tyryshkin, Alexei; Bokor, Jeffrey

2014-08-26

A quantum bit computing architecture includes a plurality of single spin memory donor atoms embedded in a semiconductor layer, a plurality of quantum dots arranged with the semiconductor layer and aligned with the donor atoms, wherein a first voltage applied across at least one pair of the aligned quantum dot and donor atom controls a donor-quantum dot coupling. A method of performing quantum computing in a scalable architecture quantum computing apparatus includes arranging a pattern of single spin memory donor atoms in a semiconductor layer, forming a plurality of quantum dots arranged with the semiconductor layer and aligned with the donor atoms, applying a first voltage across at least one aligned pair of a quantum dot and donor atom to control a donor-quantum dot coupling, and applying a second voltage between one or more quantum dots to control a Heisenberg exchange J coupling between quantum dots and to cause transport of a single spin polarized electron between quantum dots.
High performance data transfer

NASA Astrophysics Data System (ADS)

Cottrell, R.; Fang, C.; Hanushevsky, A.; Kreuger, W.; Yang, W.

2017-10-01

The exponentially increasing need for high speed data transfer is driven by big data, and cloud computing together with the needs of data intensive science, High Performance Computing (HPC), defense, the oil and gas industry etc. We report on the Zettar ZX software. This has been developed since 2013 to meet these growing needs by providing high performance data transfer and encryption in a scalable, balanced, easy to deploy and use way while minimizing power and space utilization. In collaboration with several commercial vendors, Proofs of Concept (PoC) consisting of clusters have been put together using off-the- shelf components to test the ZX scalability and ability to balance services using multiple cores, and links. The PoCs are based on SSD flash storage that is managed by a parallel file system. Each cluster occupies 4 rack units. Using the PoCs, between clusters we have achieved almost 200Gbps memory to memory over two 100Gbps links, and 70Gbps parallel file to parallel file with encryption over a 5000 mile 100Gbps link.
Multiprocessor shared-memory information exchange

DOE Office of Scientific and Technical Information (OSTI.GOV)

Santoline, L.L.; Bowers, M.D.; Crew, A.W.

1989-02-01

In distributed microprocessor-based instrumentation and control systems, the inter-and intra-subsystem communication requirements ultimately form the basis for the overall system architecture. This paper describes a software protocol which addresses the intra-subsystem communications problem. Specifically the protocol allows for multiple processors to exchange information via a shared-memory interface. The authors primary goal is to provide a reliable means for information to be exchanged between central application processor boards (masters) and dedicated function processor boards (slaves) in a single computer chassis. The resultant Multiprocessor Shared-Memory Information Exchange (MSMIE) protocol, a standard master-slave shared-memory interface suitable for use in nuclear safety systems, ismore » designed to pass unidirectional buffers of information between the processors while providing a minimum, deterministic cycle time for this data exchange.« less
Ferroelectric tunneling element and memory applications which utilize the tunneling element

DOEpatents

Kalinin, Sergei V [Knoxville, TN; Christen, Hans M [Knoxville, TN; Baddorf, Arthur P [Knoxville, TN; Meunier, Vincent [Knoxville, TN; Lee, Ho Nyung [Oak Ridge, TN

2010-07-20

A tunneling element includes a thin film layer of ferroelectric material and a pair of dissimilar electrically-conductive layers disposed on opposite sides of the ferroelectric layer. Because of the dissimilarity in composition or construction between the electrically-conductive layers, the electron transport behavior of the electrically-conductive layers is polarization dependent when the tunneling element is below the Curie temperature of the layer of ferroelectric material. The element can be used as a basis of compact 1R type non-volatile random access memory (RAM). The advantages include extremely simple architecture, ultimate scalability and fast access times generic for all ferroelectric memories.
Memory access in shared virtual memory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Berrendorf, R.

1992-01-01

Shared virtual memory (SVM) is a virtual memory layer with a single address space on top of a distributed real memory on parallel computers. We examine the behavior and performance of SVM running a parallel program with medium-grained, loop-level parallelism on top of it. A simulator for the underlying parallel architecture can be used to examine the behavior of SVM more deeply. The influence of several parameters, such as the number of processors, page size, cold or warm start, and restricted page replication, is studied.
Memory access in shared virtual memory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Berrendorf, R.

1992-09-01

Shared virtual memory (SVM) is a virtual memory layer with a single address space on top of a distributed real memory on parallel computers. We examine the behavior and performance of SVM running a parallel program with medium-grained, loop-level parallelism on top of it. A simulator for the underlying parallel architecture can be used to examine the behavior of SVM more deeply. The influence of several parameters, such as the number of processors, page size, cold or warm start, and restricted page replication, is studied.
Working Memory Span Development: A Time-Based Resource-Sharing Model Account

ERIC Educational Resources Information Center

Barrouillet, Pierre; Gavens, Nathalie; Vergauwe, Evie; Gaillard, Vinciane; Camos, Valerie

2009-01-01

The time-based resource-sharing model (P. Barrouillet, S. Bernardin, & V. Camos, 2004) assumes that during complex working memory span tasks, attention is frequently and surreptitiously switched from processing to reactivate decaying memory traces before their complete loss. Three experiments involving children from 5 to 14 years of age…
Direct access inter-process shared memory

DOEpatents

Brightwell, Ronald B; Pedretti, Kevin; Hudson, Trammell B

2013-10-22

A technique for directly sharing physical memory between processes executing on processor cores is described. The technique includes loading a plurality of processes into the physical memory for execution on a corresponding plurality of processor cores sharing the physical memory. An address space is mapped to each of the processes by populating a first entry in a top level virtual address table for each of the processes. The address space of each of the processes is cross-mapped into each of the processes by populating one or more subsequent entries of the top level virtual address table with the first entry in the top level virtual address table from other processes.
Memory Network For Distributed Data Processors

NASA Technical Reports Server (NTRS)

Bolen, David; Jensen, Dean; Millard, ED; Robinson, Dave; Scanlon, George

1992-01-01

Universal Memory Network (UMN) is modular, digital data-communication system enabling computers with differing bus architectures to share 32-bit-wide data between locations up to 3 km apart with less than one millisecond of latency. Makes it possible to design sophisticated real-time and near-real-time data-processing systems without data-transfer "bottlenecks". This enterprise network permits transmission of volume of data equivalent to an encyclopedia each second. Facilities benefiting from Universal Memory Network include telemetry stations, simulation facilities, power-plants, and large laboratories or any facility sharing very large volumes of data. Main hub of UMN is reflection center including smaller hubs called Shared Memory Interfaces.
Grouping and binding in visual short-term memory.

PubMed

Quinlan, Philip T; Cohen, Dale J

2012-09-01

Findings of 2 experiments are reported that challenge the current understanding of visual short-term memory (VSTM). In both experiments, a single study display, containing 6 colored shapes, was presented briefly and then probed with a single colored shape. At stake is how VSTM retains a record of different objects that share common features: In the 1st experiment, 2 study items sometimes shared a common feature (either a shape or a color). The data revealed a color sharing effect, in which memory was much better for items that shared a common color than for items that did not. The 2nd experiment showed that the size of the color sharing effect depended on whether a single pair of items shared a common color or whether 2 pairs of items were so defined-memory for all items improved when 2 color groups were presented. In explaining performance, an account is advanced in which items compete for a fixed number of slots, but then memory recall for any given stored item is prone to error. A critical assumption is that items that share a common color are stored together in a slot as a chunk. The evidence provides further support for the idea that principles of perceptual organization may determine the manner in which items are stored in VSTM. PsycINFO Database Record (c) 2012 APA, all rights reserved.
Nanophotonic rare-earth quantum memory with optically controlled retrieval.

PubMed

Zhong, Tian; Kindem, Jonathan M; Bartholomew, John G; Rochman, Jake; Craiciu, Ioana; Miyazono, Evan; Bettinelli, Marco; Cavalli, Enrico; Verma, Varun; Nam, Sae Woo; Marsili, Francesco; Shaw, Matthew D; Beyer, Andrew D; Faraon, Andrei

2017-09-29

Optical quantum memories are essential elements in quantum networks for long-distance distribution of quantum entanglement. Scalable development of quantum network nodes requires on-chip qubit storage functionality with control of the readout time. We demonstrate a high-fidelity nanophotonic quantum memory based on a mesoscopic neodymium ensemble coupled to a photonic crystal cavity. The nanocavity enables >95% spin polarization for efficient initialization of the atomic frequency comb memory and time bin-selective readout through an enhanced optical Stark shift of the comb frequencies. Our solid-state memory is integrable with other chip-scale photon source and detector devices for multiplexed quantum and classical information processing at the network nodes. Copyright © 2017 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works.
Scalable Domain Decomposed Monte Carlo Particle Transport

NASA Astrophysics Data System (ADS)

O'Brien, Matthew Joseph

In this dissertation, we present the parallel algorithms necessary to run domain decomposed Monte Carlo particle transport on large numbers of processors (millions of processors). Previous algorithms were not scalable, and the parallel overhead became more computationally costly than the numerical simulation. The main algorithms we consider are: • Domain decomposition of constructive solid geometry: enables extremely large calculations in which the background geometry is too large to fit in the memory of a single computational node. • Load Balancing: keeps the workload per processor as even as possible so the calculation runs efficiently. • Global Particle Find: if particles are on the wrong processor, globally resolve their locations to the correct processor based on particle coordinate and background domain. • Visualizing constructive solid geometry, sourcing particles, deciding that particle streaming communication is completed and spatial redecomposition. These algorithms are some of the most important parallel algorithms required for domain decomposed Monte Carlo particle transport. We demonstrate that our previous algorithms were not scalable, prove that our new algorithms are scalable, and run some of the algorithms up to 2 million MPI processes on the Sequoia supercomputer.
Particle Communication and Domain Neighbor Coupling: Scalable Domain Decomposed Algorithms for Monte Carlo Particle Transport

DOE Office of Scientific and Technical Information (OSTI.GOV)

O'Brien, M. J.; Brantley, P. S.

2015-01-20

In order to run Monte Carlo particle transport calculations on new supercomputers with hundreds of thousands or millions of processors, care must be taken to implement scalable algorithms. This means that the algorithms must continue to perform well as the processor count increases. In this paper, we examine the scalability of:(1) globally resolving the particle locations on the correct processor, (2) deciding that particle streaming communication has finished, and (3) efficiently coupling neighbor domains together with different replication levels. We have run domain decomposed Monte Carlo particle transport on up to 2 21 = 2,097,152 MPI processes on the IBMmore » BG/Q Sequoia supercomputer and observed scalable results that agree with our theoretical predictions. These calculations were carefully constructed to have the same amount of work on every processor, i.e. the calculation is already load balanced. We also examine load imbalanced calculations where each domain’s replication level is proportional to its particle workload. In this case we show how to efficiently couple together adjacent domains to maintain within workgroup load balance and minimize memory usage.« less
Scalable parallel distance field construction for large-scale applications

DOE PAGES

Yu, Hongfeng; Xie, Jinrong; Ma, Kwan -Liu; ...

2015-10-01

Computing distance fields is fundamental to many scientific and engineering applications. Distance fields can be used to direct analysis and reduce data. In this paper, we present a highly scalable method for computing 3D distance fields on massively parallel distributed-memory machines. Anew distributed spatial data structure, named parallel distance tree, is introduced to manage the level sets of data and facilitate surface tracking overtime, resulting in significantly reduced computation and communication costs for calculating the distance to the surface of interest from any spatial locations. Our method supports several data types and distance metrics from real-world applications. We demonstrate itsmore » efficiency and scalability on state-of-the-art supercomputers using both large-scale volume datasets and surface models. We also demonstrate in-situ distance field computation on dynamic turbulent flame surfaces for a petascale combustion simulation. In conclusion, our work greatly extends the usability of distance fields for demanding applications.« less
Scalable Parallel Distance Field Construction for Large-Scale Applications.

PubMed

Yu, Hongfeng; Xie, Jinrong; Ma, Kwan-Liu; Kolla, Hemanth; Chen, Jacqueline H

2015-10-01

Computing distance fields is fundamental to many scientific and engineering applications. Distance fields can be used to direct analysis and reduce data. In this paper, we present a highly scalable method for computing 3D distance fields on massively parallel distributed-memory machines. A new distributed spatial data structure, named parallel distance tree, is introduced to manage the level sets of data and facilitate surface tracking over time, resulting in significantly reduced computation and communication costs for calculating the distance to the surface of interest from any spatial locations. Our method supports several data types and distance metrics from real-world applications. We demonstrate its efficiency and scalability on state-of-the-art supercomputers using both large-scale volume datasets and surface models. We also demonstrate in-situ distance field computation on dynamic turbulent flame surfaces for a petascale combustion simulation. Our work greatly extends the usability of distance fields for demanding applications.
Implementing Journaling in a Linux Shared Disk File System

NASA Technical Reports Server (NTRS)

Preslan, Kenneth W.; Barry, Andrew; Brassow, Jonathan; Cattelan, Russell; Manthei, Adam; Nygaard, Erling; VanOort, Seth; Teigland, David; Tilstra, Mike; O'Keefe, Matthew;

2000-01-01

In computer systems today, speed and responsiveness is often determined by network and storage subsystem performance. Faster, more scalable networking interfaces like Fibre Channel and Gigabit Ethernet provide the scaffolding from which higher performance computer systems implementations may be constructed, but new thinking is required about how machines interact with network-enabled storage devices. In this paper we describe how we implemented journaling in the Global File System (GFS), a shared-disk, cluster file system for Linux. Our previous three papers on GFS at the Mass Storage Symposium discussed our first three GFS implementations, their performance, and the lessons learned. Our fourth paper describes, appropriately enough, the evolution of GFS version 3 to version 4, which supports journaling and recovery from client failures. In addition, GFS scalability tests extending to 8 machines accessing 8 4-disk enclosures were conducted: these tests showed good scaling. We describe the GFS cluster infrastructure, which is necessary for proper recovery from machine and disk failures in a collection of machines sharing disks using GFS. Finally, we discuss the suitability of Linux for handling the big data requirements of supercomputing centers.

Shared memories reveal shared structure in neural activity across individuals

PubMed Central

Chen, J.; Leong, Y.C.; Honey, C.J.; Yong, C.H.; Norman, K.A.; Hasson, U.

2016-01-01

Our lives revolve around sharing experiences and memories with others. When different people recount the same events, how similar are their underlying neural representations? Participants viewed a fifty-minute movie, then verbally described the events during functional MRI, producing unguided detailed descriptions lasting up to forty minutes. As each person spoke, event-specific spatial patterns were reinstated in default-network, medial-temporal, and high-level visual areas. Individual event patterns were both highly discriminable from one another and similar between people, suggesting consistent spatial organization. In many high-order areas, patterns were more similar between people recalling the same event than between recall and perception, indicating systematic reshaping of percept into memory. These results reveal the existence of a common spatial organization for memories in high-level cortical areas, where encoded information is largely abstracted beyond sensory constraints; and that neural patterns during perception are altered systematically across people into shared memory representations for real-life events. PMID:27918531
Scalable streaming tools for analyzing N-body simulations: Finding halos and investigating excursion sets in one pass

NASA Astrophysics Data System (ADS)

Ivkin, N.; Liu, Z.; Yang, L. F.; Kumar, S. S.; Lemson, G.; Neyrinck, M.; Szalay, A. S.; Braverman, V.; Budavari, T.

2018-04-01

Cosmological N-body simulations play a vital role in studying models for the evolution of the Universe. To compare to observations and make a scientific inference, statistic analysis on large simulation datasets, e.g., finding halos, obtaining multi-point correlation functions, is crucial. However, traditional in-memory methods for these tasks do not scale to the datasets that are forbiddingly large in modern simulations. Our prior paper (Liu et al., 2015) proposes memory-efficient streaming algorithms that can find the largest halos in a simulation with up to 109 particles on a small server or desktop. However, this approach fails when directly scaling to larger datasets. This paper presents a robust streaming tool that leverages state-of-the-art techniques on GPU boosting, sampling, and parallel I/O, to significantly improve performance and scalability. Our rigorous analysis of the sketch parameters improves the previous results from finding the centers of the 103 largest halos (Liu et al., 2015) to ∼ 104 - 105, and reveals the trade-offs between memory, running time and number of halos. Our experiments show that our tool can scale to datasets with up to ∼ 1012 particles while using less than an hour of running time on a single GPU Nvidia GTX 1080.
Parallel computing for probabilistic fatigue analysis

NASA Technical Reports Server (NTRS)

Sues, Robert H.; Lua, Yuan J.; Smith, Mark D.

1993-01-01

This paper presents the results of Phase I research to investigate the most effective parallel processing software strategies and hardware configurations for probabilistic structural analysis. We investigate the efficiency of both shared and distributed-memory architectures via a probabilistic fatigue life analysis problem. We also present a parallel programming approach, the virtual shared-memory paradigm, that is applicable across both types of hardware. Using this approach, problems can be solved on a variety of parallel configurations, including networks of single or multiprocessor workstations. We conclude that it is possible to effectively parallelize probabilistic fatigue analysis codes; however, special strategies will be needed to achieve large-scale parallelism to keep large number of processors busy and to treat problems with the large memory requirements encountered in practice. We also conclude that distributed-memory architecture is preferable to shared-memory for achieving large scale parallelism; however, in the future, the currently emerging hybrid-memory architectures will likely be optimal.
Scalable Robust Principal Component Analysis Using Grassmann Averages.

PubMed

Hauberg, Sren; Feragen, Aasa; Enficiaud, Raffi; Black, Michael J

2016-11-01

In large datasets, manual data verification is impossible, and we must expect the number of outliers to increase with data size. While principal component analysis (PCA) can reduce data size, and scalable solutions exist, it is well-known that outliers can arbitrarily corrupt the results. Unfortunately, state-of-the-art approaches for robust PCA are not scalable. We note that in a zero-mean dataset, each observation spans a one-dimensional subspace, giving a point on the Grassmann manifold. We show that the average subspace corresponds to the leading principal component for Gaussian data. We provide a simple algorithm for computing this Grassmann Average ( GA), and show that the subspace estimate is less sensitive to outliers than PCA for general distributions. Because averages can be efficiently computed, we immediately gain scalability. We exploit robust averaging to formulate the Robust Grassmann Average (RGA) as a form of robust PCA. The resulting Trimmed Grassmann Average ( TGA) is appropriate for computer vision because it is robust to pixel outliers. The algorithm has linear computational complexity and minimal memory requirements. We demonstrate TGA for background modeling, video restoration, and shadow removal. We show scalability by performing robust PCA on the entire Star Wars IV movie; a task beyond any current method. Source code is available online.

Mnemonic transmission, social contagion, and emergence of collective memory: Influence of emotional valence, group structure, and information distribution.

PubMed

Choi, Hae-Yoon; Kensinger, Elizabeth A; Rajaram, Suparna

2017-09-01

Social transmission of memory and its consequence on collective memory have generated enduring interdisciplinary interest because of their widespread significance in interpersonal, sociocultural, and political arenas. We tested the influence of 3 key factors-emotional salience of information, group structure, and information distribution-on mnemonic transmission, social contagion, and collective memory. Participants individually studied emotionally salient (negative or positive) and nonemotional (neutral) picture-word pairs that were completely shared, partially shared, or unshared within participant triads, and then completed 3 consecutive recalls in 1 of 3 conditions: individual-individual-individual (control), collaborative-collaborative (identical group; insular structure)-individual, and collaborative-collaborative (reconfigured group; diverse structure)-individual. Collaboration enhanced negative memories especially in insular group structure and especially for shared information, and promoted collective forgetting of positive memories. Diverse group structure reduced this negativity effect. Unequally distributed information led to social contagion that creates false memories; diverse structure propagated a greater variety of false memories whereas insular structure promoted confidence in false recognition and false collective memory. A simultaneous assessment of network structure, information distribution, and emotional valence breaks new ground to specify how network structure shapes the spread of negative memories and false memories, and the emergence of collective memory. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences.

PubMed

Mirarab, Siavash; Nguyen, Nam; Guo, Sheng; Wang, Li-San; Kim, Junhyong; Warnow, Tandy

2015-05-01

We introduce PASTA, a new multiple sequence alignment algorithm. PASTA uses a new technique to produce an alignment given a guide tree that enables it to be both highly scalable and very accurate. We present a study on biological and simulated data with up to 200,000 sequences, showing that PASTA produces highly accurate alignments, improving on the accuracy and scalability of the leading alignment methods (including SATé). We also show that trees estimated on PASTA alignments are highly accurate--slightly better than SATé trees, but with substantial improvements relative to other methods. Finally, PASTA is faster than SATé, highly parallelizable, and requires relatively little memory.
Effects of cacheing on multitasking efficiency and programming strategy on an ELXSI 6400

DOE Office of Scientific and Technical Information (OSTI.GOV)

Montry, G.R.; Benner, R.E.

1985-12-01

The impact of a cache/shared memory architecture, and, in particular, the cache coherency problem, upon concurrent algorithm and program development is discussed. In this context, a simple set of programming strategies are proposed which streamline code development and improve code performance when multitasking in a cache/shared memory or distributed memory environment.
Implementation of a parallel unstructured Euler solver on shared and distributed memory architectures

NASA Technical Reports Server (NTRS)

Mavriplis, D. J.; Das, Raja; Saltz, Joel; Vermeland, R. E.

1992-01-01

An efficient three dimensional unstructured Euler solver is parallelized on a Cray Y-MP C90 shared memory computer and on an Intel Touchstone Delta distributed memory computer. This paper relates the experiences gained and describes the software tools and hardware used in this study. Performance comparisons between two differing architectures are made.
Progress toward scalable tomography of quantum maps using twirling-based methods and information hierarchies

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lopez, Cecilia C.; Theoretische Physik, Universitaet des Saarlandes, D-66041 Saarbruecken; Departament de Fisica, Universitat Autonoma de Barcelona, E-08193 Bellaterra

2010-06-15

We present in a unified manner the existing methods for scalable partial quantum process tomography. We focus on two main approaches: the one presented in Bendersky et al. [Phys. Rev. Lett. 100, 190403 (2008)] and the ones described, respectively, in Emerson et al. [Science 317, 1893 (2007)] and Lopez et al. [Phys. Rev. A 79, 042328 (2009)], which can be combined together. The methods share an essential feature: They are based on the idea that the tomography of a quantum map can be efficiently performed by studying certain properties of a twirling of such a map. From this perspective, inmore » this paper we present extensions, improvements, and comparative analyses of the scalable methods for partial quantum process tomography. We also clarify the significance of the extracted information, and we introduce interesting and useful properties of the {chi}-matrix representation of quantum maps that can be used to establish a clearer path toward achieving full tomography of quantum processes in a scalable way.« less
Emerging memories

NASA Astrophysics Data System (ADS)

Baldi, Livio; Bez, Roberto; Sandhu, Gurtej

2014-12-01

Memory is a key component of any data processing system. Following the classical Turing machine approach, memories hold both the data to be processed and the rules for processing them. In the history of microelectronics, the distinction has been rather between working memory, which is exemplified by DRAM, and storage memory, exemplified by NAND. These two types of memory devices now represent 90% of all memory market and 25% of the total semiconductor market, and have been the technology drivers in the last decades. Even if radically different in characteristics, they are however based on the same storage mechanism: charge storage, and this mechanism seems to be near to reaching its physical limits. The search for new alternative memory approaches, based on more scalable mechanisms, has therefore gained new momentum. The status of incumbent memory technologies and their scaling limitations will be discussed. Emerging memory technologies will be analyzed, starting from the ones that are already present for niche applications, and which are getting new attention, thanks to recent technology breakthroughs. Maturity level, physical limitations and potential for scaling will be compared to existing memories. At the end the possible future composition of memory systems will be discussed.
Sleep Benefits Memory for Semantic Category Structure While Preserving Exemplar-Specific Information.

PubMed

Schapiro, Anna C; McDevitt, Elizabeth A; Chen, Lang; Norman, Kenneth A; Mednick, Sara C; Rogers, Timothy T

2017-11-01

Semantic memory encompasses knowledge about both the properties that typify concepts (e.g. robins, like all birds, have wings) as well as the properties that individuate conceptually related items (e.g. robins, in particular, have red breasts). We investigate the impact of sleep on new semantic learning using a property inference task in which both kinds of information are initially acquired equally well. Participants learned about three categories of novel objects possessing some properties that were shared among category exemplars and others that were unique to an exemplar, with exposure frequency varying across categories. In Experiment 1, memory for shared properties improved and memory for unique properties was preserved across a night of sleep, while memory for both feature types declined over a day awake. In Experiment 2, memory for shared properties improved across a nap, but only for the lower-frequency category, suggesting a prioritization of weakly learned information early in a sleep period. The increase was significantly correlated with amount of REM, but was also observed in participants who did not enter REM, suggesting involvement of both REM and NREM sleep. The results provide the first evidence that sleep improves memory for the shared structure of object categories, while simultaneously preserving object-unique information.
Exploiting multi-scale parallelism for large scale numerical modelling of laser wakefield accelerators

NASA Astrophysics Data System (ADS)

Fonseca, R. A.; Vieira, J.; Fiuza, F.; Davidson, A.; Tsung, F. S.; Mori, W. B.; Silva, L. O.

2013-12-01

A new generation of laser wakefield accelerators (LWFA), supported by the extreme accelerating fields generated in the interaction of PW-Class lasers and underdense targets, promises the production of high quality electron beams in short distances for multiple applications. Achieving this goal will rely heavily on numerical modelling to further understand the underlying physics and identify optimal regimes, but large scale modelling of these scenarios is computationally heavy and requires the efficient use of state-of-the-art petascale supercomputing systems. We discuss the main difficulties involved in running these simulations and the new developments implemented in the OSIRIS framework to address these issues, ranging from multi-dimensional dynamic load balancing and hybrid distributed/shared memory parallelism to the vectorization of the PIC algorithm. We present the results of the OASCR Joule Metric program on the issue of large scale modelling of LWFA, demonstrating speedups of over 1 order of magnitude on the same hardware. Finally, scalability to over ˜106 cores and sustained performance over ˜2 P Flops is demonstrated, opening the way for large scale modelling of LWFA scenarios.
Novel Plasmonic Materials and Nanodevices for Integrated Quantum Photonics

NASA Astrophysics Data System (ADS)

Shalaginov, Mikhail Y.

Light-matter interaction is the foundation for numerous important quantum optical phenomena, which may be harnessed to build practical devices with higher efficiency and unprecedented functionality. Nanoscale engineering is seen as a fruitful avenue to significantly strengthen light-matter interaction and also make quantum optical systems ultra-compact, scalable, and energy efficient. This research focuses on color centers in diamond that share quantum properties with single atoms. These systems promise a path for the realization of practical quantum devices such as nanoscale sensors, single-photon sources, and quantum memories. In particular, we explored an intriguing methodology of utilizing nanophotonic structures, such as hyperbolic metamaterials, nanoantennae, and plasmonic waveguides, to improve the color centers performance. We observed enhancement in the color center's spontaneous emission rate, emission directionality, and cooperativity over a broad optical frequency range. Additionally, we studied the effect of plasmonic environments on the spin-readout sensitivity of color centers. The use of CMOS-compatible epitaxially grown plasmonic materials in the design of these nanophotonic structures promises a new level of performance for a variety of integrated room-temperature quantum devices based on diamond color centers.
Automatic Generation of Directive-Based Parallel Programs for Shared Memory Parallel Systems

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Yan, Jerry; Frumkin, Michael

2000-01-01

The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. As great progress was made in hardware and software technologies, performance of parallel programs with compiler directives has demonstrated large improvement. The introduction of OpenMP directives, the industrial standard for shared-memory programming, has minimized the issue of portability. Due to its ease of programming and its good performance, the technique has become very popular. In this study, we have extended CAPTools, a computer-aided parallelization toolkit, to automatically generate directive-based, OpenMP, parallel programs. We outline techniques used in the implementation of the tool and present test results on the NAS parallel benchmarks and ARC3D, a CFD application. This work demonstrates the great potential of using computer-aided tools to quickly port parallel programs and also achieve good performance.
Next Generation Mass Memory Architecture

NASA Astrophysics Data System (ADS)

Herpel, H.-J.; Stahle, M.; Lonsdorfer, U.; Binzer, N.

2010-08-01

Future Mass Memory units will have to cope with various demanding requirements driven by onboard instruments (optical and SAR) that generate a huge amount of data (>10TBit) at a data rate > 6 Gbps. For downlink data rates around 3 Gbps will be feasible using latest ka-band technology together with Variable Coding and Modulation (VCM) techniques. These high data rates and storage capacities need to be effectively managed. Therefore, data structures and data management functions have to be improved and adapted to existing standards like the Packet Utilisation Standard (PUS). In this paper we will present a highly modular and scalable architectural approach for mass memories in order to support a wide range of mission requirements.
Semihierarchical quantum repeaters based on moderate lifetime quantum memories

NASA Astrophysics Data System (ADS)

Liu, Xiao; Zhou, Zong-Quan; Hua, Yi-Lin; Li, Chuan-Feng; Guo, Guang-Can

2017-01-01

The construction of large-scale quantum networks relies on the development of practical quantum repeaters. Many approaches have been proposed with the goal of outperforming the direct transmission of photons, but most of them are inefficient or difficult to implement with current technology. Here, we present a protocol that uses a semihierarchical structure to improve the entanglement distribution rate while reducing the requirement of memory time to a range of tens of milliseconds. This protocol can be implemented with a fixed distance of elementary links and fixed requirements on quantum memories, which are independent of the total distance. This configuration is especially suitable for scalable applications in large-scale quantum networks.
TESS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dmitriy Morozov, Tom Peterka

2014-07-29

Computing a Voronoi or Delaunay tessellation from a set of points is a core part of the analysis of many simulated and measured datasets. As the scale of simulations and observations surpasses billions of particles, a distributed-memory scalable parallel algorithm is the only feasible approach. The primary contribution of this software is a distributed-memory parallel Delaunay and Voronoi tessellation algorithm based on existing serial computational geometry libraries that automatically determines which neighbor points need to be exchanged among the subdomains of a spatial decomposition. Other contributions include the addition of periodic and wall boundary conditions.
Structurally Integrated Versus Structurally Segregated Memory Representations: Implications for the Design of Instructional Materials.

ERIC Educational Resources Information Center

Hayes-Roth, Barbara

Two kinds of memory organization are distinguished: segregrated versus integrated. In segregated memory organizations, related learned propositions have separate memory representations. In integrated memory organizations, memory representations of related propositions share common subrepresentations. Segregated memory organizations facilitate…
Getting connected: Both associative and semantic links structure semantic memory for newly learned persons.

PubMed

Wiese, Holger; Schweinberger, Stefan R

2015-01-01

The present study examined whether semantic memory for newly learned people is structured by visual co-occurrence, shared semantics, or both. Participants were trained with pairs of simultaneously presented (i.e., co-occurring) preexperimentally unfamiliar faces, which either did or did not share additionally provided semantic information (occupation, place of living, etc.). Semantic information could also be shared between faces that did not co-occur. A subsequent priming experiment revealed faster responses for both co-occurrence/no shared semantics and no co-occurrence/shared semantics conditions, than for an unrelated condition. Strikingly, priming was strongest in the co-occurrence/shared semantics condition, suggesting additive effects of these factors. Additional analysis of event-related brain potentials yielded priming in the N400 component only for combined effects of visual co-occurrence and shared semantics, with more positive amplitudes in this than in the unrelated condition. Overall, these findings suggest that both semantic relatedness and visual co-occurrence are important when novel information is integrated into person-related semantic memory.
Memory T and memory B cells share a transcriptional program of self-renewal with long-term hematopoietic stem cells

PubMed Central

Luckey, Chance John; Bhattacharya, Deepta; Goldrath, Ananda W.; Weissman, Irving L.; Benoist, Christophe; Mathis, Diane

2006-01-01

The only cells of the hematopoietic system that undergo self-renewal for the lifetime of the organism are long-term hematopoietic stem cells and memory T and B cells. To determine whether there is a shared transcriptional program among these self-renewing populations, we first compared the gene-expression profiles of naïve, effector and memory CD8+ T cells with those of long-term hematopoietic stem cells, short-term hematopoietic stem cells, and lineage-committed progenitors. Transcripts augmented in memory CD8+ T cells relative to naïve and effector T cells were selectively enriched in long-term hematopoietic stem cells and were progressively lost in their short-term and lineage-committed counterparts. Furthermore, transcripts selectively decreased in memory CD8+ T cells were selectively down-regulated in long-term hematopoietic stem cells and progressively increased with differentiation. To confirm that this pattern was a general property of immunologic memory, we turned to independently generated gene expression profiles of memory, naïve, germinal center, and plasma B cells. Once again, memory-enriched and -depleted transcripts were also appropriately augmented and diminished in long-term hematopoietic stem cells, and their expression correlated with progressive loss of self-renewal function. Thus, there appears to be a common signature of both up- and down-regulated transcripts shared between memory T cells, memory B cells, and long-term hematopoietic stem cells. This signature was not consistently enriched in neural or embryonic stem cell populations and, therefore, appears to be restricted to the hematopoeitic system. These observations provide evidence that the shared phenotype of self-renewal in the hematopoietic system is linked at the molecular level. PMID:16492737
Categorical and associative relations increase false memory relative to purely associative relations.

PubMed

Coane, Jennifer H; McBride, Dawn M; Termonen, Miia-Liisa; Cutting, J Cooper

2016-01-01

The goal of the present study was to examine the contributions of associative strength and similarity in terms of shared features to the production of false memories in the Deese/Roediger-McDermott list-learning paradigm. Whereas the activation/monitoring account suggests that false memories are driven by automatic associative activation from list items to nonpresented lures, combined with errors in source monitoring, other accounts (e.g., fuzzy trace theory, global-matching models) emphasize the importance of semantic-level similarity, and thus predict that shared features between list and lure items will increase false memory. Participants studied lists of nine items related to a nonpresented lure. Half of the lists consisted of items that were associated but did not share features with the lure, and the other half included items that were equally associated but also shared features with the lure (in many cases, these were taxonomically related items). The two types of lists were carefully matched in terms of a variety of lexical and semantic factors, and the same lures were used across list types. In two experiments, false recognition of the critical lures was greater following the study of lists that shared features with the critical lure, suggesting that similarity at a categorical or taxonomic level contributes to false memory above and beyond associative strength. We refer to this phenomenon as a "feature boost" that reflects additive effects of shared meaning and association strength and is generally consistent with accounts of false memory that have emphasized thematic or feature-level similarity among studied and nonstudied representations.
System and method for programmable bank selection for banked memory subsystems

DOEpatents

Blumrich, Matthias A.; Chen, Dong; Gara, Alan G.; Giampapa, Mark E.; Hoenicke, Dirk; Ohmacht, Martin; Salapura, Valentina; Sugavanam, Krishnan

2010-09-07

A programmable memory system and method for enabling one or more processor devices access to shared memory in a computing environment, the shared memory including one or more memory storage structures having addressable locations for storing data. The system comprises: one or more first logic devices associated with a respective one or more processor devices, each first logic device for receiving physical memory address signals and programmable for generating a respective memory storage structure select signal upon receipt of pre-determined address bit values at selected physical memory address bit locations; and, a second logic device responsive to each of the respective select signal for generating an address signal used for selecting a memory storage structure for processor access. The system thus enables each processor device of a computing environment memory storage access distributed across the one or more memory storage structures.
High Performance Programming Using Explicit Shared Memory Model on Cray T3D1

NASA Technical Reports Server (NTRS)

Simon, Horst D.; Saini, Subhash; Grassi, Charles

1994-01-01

The Cray T3D system is the first-phase system in Cray Research, Inc.'s (CRI) three-phase massively parallel processing (MPP) program. This system features a heterogeneous architecture that closely couples DEC's Alpha microprocessors and CRI's parallel-vector technology, i.e., the Cray Y-MP and Cray C90. An overview of the Cray T3D hardware and available programming models is presented. Under Cray Research adaptive Fortran (CRAFT) model four programming methods (data parallel, work sharing, message-passing using PVM, and explicit shared memory model) are available to the users. However, at this time data parallel and work sharing programming models are not available to the user community. The differences between standard PVM and CRI's PVM are highlighted with performance measurements such as latencies and communication bandwidths. We have found that the performance of neither standard PVM nor CRI s PVM exploits the hardware capabilities of the T3D. The reasons for the bad performance of PVM as a native message-passing library are presented. This is illustrated by the performance of NAS Parallel Benchmarks (NPB) programmed in explicit shared memory model on Cray T3D. In general, the performance of standard PVM is about 4 to 5 times less than obtained by using explicit shared memory model. This degradation in performance is also seen on CM-5 where the performance of applications using native message-passing library CMMD on CM-5 is also about 4 to 5 times less than using data parallel methods. The issues involved (such as barriers, synchronization, invalidating data cache, aligning data cache etc.) while programming in explicit shared memory model are discussed. Comparative performance of NPB using explicit shared memory programming model on the Cray T3D and other highly parallel systems such as the TMC CM-5, Intel Paragon, Cray C90, IBM-SP1, etc. is presented.
Fast and Scalable Computation of the Forward and Inverse Discrete Periodic Radon Transform.

PubMed

Carranza, Cesar; Llamocca, Daniel; Pattichis, Marios

2016-01-01

The discrete periodic radon transform (DPRT) has extensively been used in applications that involve image reconstructions from projections. Beyond classic applications, the DPRT can also be used to compute fast convolutions that avoids the use of floating-point arithmetic associated with the use of the fast Fourier transform. Unfortunately, the use of the DPRT has been limited by the need to compute a large number of additions and the need for a large number of memory accesses. This paper introduces a fast and scalable approach for computing the forward and inverse DPRT that is based on the use of: a parallel array of fixed-point adder trees; circular shift registers to remove the need for accessing external memory components when selecting the input data for the adder trees; an image block-based approach to DPRT computation that can fit the proposed architecture to available resources; and fast transpositions that are computed in one or a few clock cycles that do not depend on the size of the input image. As a result, for an N × N image (N prime), the proposed approach can compute up to N(2) additions per clock cycle. Compared with the previous approaches, the scalable approach provides the fastest known implementations for different amounts of computational resources. For example, for a 251×251 image, for approximately 25% fewer flip-flops than required for a systolic implementation, we have that the scalable DPRT is computed 36 times faster. For the fastest case, we introduce optimized just 2N + ⌈log(2) N⌉ + 1 and 2N + 3 ⌈log(2) N⌉ + B + 2 cycles, architectures that can compute the DPRT and its inverse in respectively, where B is the number of bits used to represent each input pixel. On the other hand, the scalable DPRT approach requires more 1-b additions than for the systolic implementation and provides a tradeoff between speed and additional 1-b additions. All of the proposed DPRT architectures were implemented in VHSIC Hardware Description Language (VHDL) and validated using an Field-Programmable Gate Array (FPGA) implementation.

Shared Memory Parallelization of an Implicit ADI-type CFD Code

NASA Technical Reports Server (NTRS)

Hauser, Th.; Huang, P. G.

1999-01-01

A parallelization study designed for ADI-type algorithms is presented using the OpenMP specification for shared-memory multiprocessor programming. Details of optimizations specifically addressed to cache-based computer architectures are described and performance measurements for the single and multiprocessor implementation are summarized. The paper demonstrates that optimization of memory access on a cache-based computer architecture controls the performance of the computational algorithm. A hybrid MPI/OpenMP approach is proposed for clusters of shared memory machines to further enhance the parallel performance. The method is applied to develop a new LES/DNS code, named LESTool. A preliminary DNS calculation of a fully developed channel flow at a Reynolds number of 180, Re(sub tau) = 180, has shown good agreement with existing data.
Address tracing for parallel machines

NASA Technical Reports Server (NTRS)

Stunkel, Craig B.; Janssens, Bob; Fuchs, W. Kent

1991-01-01

Recently implemented parallel system address-tracing methods based on several metrics are surveyed. The issues specific to collection of traces for both shared and distributed memory parallel computers are highlighted. Five general categories of address-trace collection methods are examined: hardware-captured, interrupt-based, simulation-based, altered microcode-based, and instrumented program-based traces. The problems unique to shared memory and distributed memory multiprocessors are examined separately.
Shared processing in multiple object tracking and visual working memory in the absence of response order and task order confounds

PubMed Central

Howe, Piers D. L.

2017-01-01

To understand how the visual system represents multiple moving objects and how those representations contribute to tracking, it is essential that we understand how the processes of attention and working memory interact. In the work described here we present an investigation of that interaction via a series of tracking and working memory dual-task experiments. Previously, it has been argued that tracking is resistant to disruption by a concurrent working memory task and that any apparent disruption is in fact due to observers making a response to the working memory task, rather than due to competition for shared resources. Contrary to this, in our experiments we find that when task order and response order confounds are avoided, all participants show a similar decrease in both tracking and working memory performance. However, if task and response order confounds are not adequately controlled for we find substantial individual differences, which could explain the previous conflicting reports on this topic. Our results provide clear evidence that tracking and working memory tasks share processing resources. PMID:28410383
Shared processing in multiple object tracking and visual working memory in the absence of response order and task order confounds.

PubMed

Lapierre, Mark D; Cropper, Simon J; Howe, Piers D L

2017-01-01

To understand how the visual system represents multiple moving objects and how those representations contribute to tracking, it is essential that we understand how the processes of attention and working memory interact. In the work described here we present an investigation of that interaction via a series of tracking and working memory dual-task experiments. Previously, it has been argued that tracking is resistant to disruption by a concurrent working memory task and that any apparent disruption is in fact due to observers making a response to the working memory task, rather than due to competition for shared resources. Contrary to this, in our experiments we find that when task order and response order confounds are avoided, all participants show a similar decrease in both tracking and working memory performance. However, if task and response order confounds are not adequately controlled for we find substantial individual differences, which could explain the previous conflicting reports on this topic. Our results provide clear evidence that tracking and working memory tasks share processing resources.
Visual and spatial working memory are not that dissociated after all: a time-based resource-sharing account.

PubMed

Vergauwe, Evie; Barrouillet, Pierre; Camos, Valérie

2009-07-01

Examinations of interference between visual and spatial materials in working memory have suggested domain- and process-based fractionations of visuo-spatial working memory. The present study examined the role of central time-based resource sharing in visuo-spatial working memory and assessed its role in obtained interference patterns. Visual and spatial storage were combined with both visual and spatial on-line processing components in computer-paced working memory span tasks (Experiment 1) and in a selective interference paradigm (Experiment 2). The cognitive load of the processing components was manipulated to investigate its impact on concurrent maintenance for both within-domain and between-domain combinations of processing and storage components. In contrast to both domain- and process-based fractionations of visuo-spatial working memory, the results revealed that recall performance was determined by the cognitive load induced by the processing of items, rather than by the domain to which those items pertained. These findings are interpreted as evidence for a time-based resource-sharing mechanism in visuo-spatial working memory.
Makalu: fast recoverable allocation of non-volatile memory

DOE PAGES

Bhandari, Kumud; Chakrabarti, Dhruva R.; Boehm, Hans-J.

2016-10-19

Byte addressable non-volatile memory (NVRAM) is likely to supplement, and perhaps eventually replace, DRAM. Applications can then persist data structures directly in memory instead of serializing them and storing them onto a durable block device. However, failures during execution can leave data structures in NVRAM unreachable or corrupt. In this paper, we present Makalu, a system that addresses non-volatile memory management. Makalu offers an integrated allocator and recovery-time garbage collector that maintains internal consistency, avoids NVRAM memory leaks, and is efficient, all in the face of failures. We show that a careful allocator design can support a less restrictive andmore » a much more familiar programming model than existing persistent memory allocators. Our allocator significantly reduces the per allocation persistence overhead by lazily persisting non-essential metadata and by employing a post-failure recovery-time garbage collector. Experimental results show that the resulting online speed and scalability of our allocator are comparable to well-known transient allocators, and significantly better than state-of-the-art persistent allocators.« less
Makalu: fast recoverable allocation of non-volatile memory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bhandari, Kumud; Chakrabarti, Dhruva R.; Boehm, Hans-J.

Byte addressable non-volatile memory (NVRAM) is likely to supplement, and perhaps eventually replace, DRAM. Applications can then persist data structures directly in memory instead of serializing them and storing them onto a durable block device. However, failures during execution can leave data structures in NVRAM unreachable or corrupt. In this paper, we present Makalu, a system that addresses non-volatile memory management. Makalu offers an integrated allocator and recovery-time garbage collector that maintains internal consistency, avoids NVRAM memory leaks, and is efficient, all in the face of failures. We show that a careful allocator design can support a less restrictive andmore » a much more familiar programming model than existing persistent memory allocators. Our allocator significantly reduces the per allocation persistence overhead by lazily persisting non-essential metadata and by employing a post-failure recovery-time garbage collector. Experimental results show that the resulting online speed and scalability of our allocator are comparable to well-known transient allocators, and significantly better than state-of-the-art persistent allocators.« less
MDGRAPE-4: a special-purpose computer system for molecular dynamics simulations.

PubMed

Ohmura, Itta; Morimoto, Gentaro; Ohno, Yousuke; Hasegawa, Aki; Taiji, Makoto

2014-08-06

We are developing the MDGRAPE-4, a special-purpose computer system for molecular dynamics (MD) simulations. MDGRAPE-4 is designed to achieve strong scalability for protein MD simulations through the integration of general-purpose cores, dedicated pipelines, memory banks and network interfaces (NIFs) to create a system on chip (SoC). Each SoC has 64 dedicated pipelines that are used for non-bonded force calculations and run at 0.8 GHz. Additionally, it has 65 Tensilica Xtensa LX cores with single-precision floating-point units that are used for other calculations and run at 0.6 GHz. At peak performance levels, each SoC can evaluate 51.2 G interactions per second. It also has 1.8 MB of embedded shared memory banks and six network units with a peak bandwidth of 7.2 GB s(-1) for the three-dimensional torus network. The system consists of 512 (8×8×8) SoCs in total, which are mounted on 64 node modules with eight SoCs. The optical transmitters/receivers are used for internode communication. The expected maximum power consumption is 50 kW. While MDGRAPE-4 software has still been improved, we plan to run MD simulations on MDGRAPE-4 in 2014. The MDGRAPE-4 system will enable long-time molecular dynamics simulations of small systems. It is also useful for multiscale molecular simulations where the particle simulation parts often become bottlenecks.
MDGRAPE-4: a special-purpose computer system for molecular dynamics simulations

PubMed Central

Ohmura, Itta; Morimoto, Gentaro; Ohno, Yousuke; Hasegawa, Aki; Taiji, Makoto

2014-01-01

We are developing the MDGRAPE-4, a special-purpose computer system for molecular dynamics (MD) simulations. MDGRAPE-4 is designed to achieve strong scalability for protein MD simulations through the integration of general-purpose cores, dedicated pipelines, memory banks and network interfaces (NIFs) to create a system on chip (SoC). Each SoC has 64 dedicated pipelines that are used for non-bonded force calculations and run at 0.8 GHz. Additionally, it has 65 Tensilica Xtensa LX cores with single-precision floating-point units that are used for other calculations and run at 0.6 GHz. At peak performance levels, each SoC can evaluate 51.2 G interactions per second. It also has 1.8 MB of embedded shared memory banks and six network units with a peak bandwidth of 7.2 GB s−1 for the three-dimensional torus network. The system consists of 512 (8×8×8) SoCs in total, which are mounted on 64 node modules with eight SoCs. The optical transmitters/receivers are used for internode communication. The expected maximum power consumption is 50 kW. While MDGRAPE-4 software has still been improved, we plan to run MD simulations on MDGRAPE-4 in 2014. The MDGRAPE-4 system will enable long-time molecular dynamics simulations of small systems. It is also useful for multiscale molecular simulations where the particle simulation parts often become bottlenecks. PMID:24982255
Why are you telling me that? A conceptual model of the social function of autobiographical memory.

PubMed

Alea, Nicole; Bluck, Susan

2003-03-01

In an effort to stimulate and guide empirical work within a functional framework, this paper provides a conceptual model of the social functions of autobiographical memory (AM) across the lifespan. The model delineates the processes and variables involved when AMs are shared to serve social functions. Components of the model include: lifespan contextual influences, the qualitative characteristics of memory (emotionality and level of detail recalled), the speaker's characteristics (age, gender, and personality), the familiarity and similarity of the listener to the speaker, the level of responsiveness during the memory-sharing process, and the nature of the social relationship in which the memory sharing occurs (valence and length of the relationship). These components are shown to influence the type of social function served and/or, the extent to which social functions are served. Directions for future empirical work to substantiate the model and hypotheses derived from the model are provided.
Scalable splitting algorithms for big-data interferometric imaging in the SKA era

NASA Astrophysics Data System (ADS)

Onose, Alexandru; Carrillo, Rafael E.; Repetti, Audrey; McEwen, Jason D.; Thiran, Jean-Philippe; Pesquet, Jean-Christophe; Wiaux, Yves

2016-11-01

In the context of next-generation radio telescopes, like the Square Kilometre Array (SKA), the efficient processing of large-scale data sets is extremely important. Convex optimization tasks under the compressive sensing framework have recently emerged and provide both enhanced image reconstruction quality and scalability to increasingly larger data sets. We focus herein mainly on scalability and propose two new convex optimization algorithmic structures able to solve the convex optimization tasks arising in radio-interferometric imaging. They rely on proximal splitting and forward-backward iterations and can be seen, by analogy, with the CLEAN major-minor cycle, as running sophisticated CLEAN-like iterations in parallel in multiple data, prior, and image spaces. Both methods support any convex regularization function, in particular, the well-studied ℓ1 priors promoting image sparsity in an adequate domain. Tailored for big-data, they employ parallel and distributed computations to achieve scalability, in terms of memory and computational requirements. One of them also exploits randomization, over data blocks at each iteration, offering further flexibility. We present simulation results showing the feasibility of the proposed methods as well as their advantages compared to state-of-the-art algorithmic solvers. Our MATLAB code is available online on GitHub.
Destination memory impairment in older people.

PubMed

Gopie, Nigel; Craik, Fergus I M; Hasher, Lynn

2010-12-01

Older adults are assumed to have poor destination memory-knowing to whom they tell particular information-and anecdotes about them repeating stories to the same people are cited as informal evidence for this claim. Experiment 1 assessed young and older adults' destination memory by having participants tell facts (e.g., "A dime has 118 ridges around its edge") to pictures of famous people (e.g., Oprah Winfrey). Surprise recognition memory tests, which also assessed confidence, revealed that older adults, compared to young adults, were disproportionately impaired on destination memory relative to spared memory for the individual components (i.e., facts, faces) of the episode. Older adults also were more confident that they had not told a fact to a particular person when they actually had (i.e., a miss); this presumably causes them to repeat information more often than young adults. When the direction of information transfer was reversed in Experiment 2, such that the famous people shared information with the participants (i.e., a source memory experiment), age-related memory differences disappeared. In contrast to the destination memory experiment, older adults in the source memory experiment were more confident than young adults that someone had shared a fact with them when a different person actually had shared the fact (i.e., a false alarm). Overall, accuracy and confidence jointly influence age-related changes to destination memory, a fundamental component of successful communication. (c) 2010 APA, all rights reserved).
DOE Office of Scientific and Technical Information (OSTI.GOV)

Duro, Francisco Rodrigo; Blas, Javier Garcia; Isaila, Florin

The increasing volume of scientific data and the limited scalability and performance of storage systems are currently presenting a significant limitation for the productivity of the scientific workflows running on both high-performance computing (HPC) and cloud platforms. Clearly needed is better integration of storage systems and workflow engines to address this problem. This paper presents and evaluates a novel solution that leverages codesign principles for integrating Hercules—an in-memory data store—with a workflow management system. We consider four main aspects: workflow representation, task scheduling, task placement, and task termination. As a result, the experimental evaluation on both cloud and HPC systemsmore » demonstrates significant performance and scalability improvements over existing state-of-the-art approaches.« less
PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences

PubMed Central

Mirarab, Siavash; Nguyen, Nam; Guo, Sheng; Wang, Li-San; Kim, Junhyong

2015-01-01

Abstract We introduce PASTA, a new multiple sequence alignment algorithm. PASTA uses a new technique to produce an alignment given a guide tree that enables it to be both highly scalable and very accurate. We present a study on biological and simulated data with up to 200,000 sequences, showing that PASTA produces highly accurate alignments, improving on the accuracy and scalability of the leading alignment methods (including SATé). We also show that trees estimated on PASTA alignments are highly accurate—slightly better than SATé trees, but with substantial improvements relative to other methods. Finally, PASTA is faster than SATé, highly parallelizable, and requires relatively little memory. PMID:25549288
A comprehensive study of MPI parallelism in three-dimensional discrete element method (DEM) simulation of complex-shaped granular particles

NASA Astrophysics Data System (ADS)

Yan, Beichuan; Regueiro, Richard A.

2018-02-01

A three-dimensional (3D) DEM code for simulating complex-shaped granular particles is parallelized using message-passing interface (MPI). The concepts of link-block, ghost/border layer, and migration layer are put forward for design of the parallel algorithm, and theoretical scalability function of 3-D DEM scalability and memory usage is derived. Many performance-critical implementation details are managed optimally to achieve high performance and scalability, such as: minimizing communication overhead, maintaining dynamic load balance, handling particle migrations across block borders, transmitting C++ dynamic objects of particles between MPI processes efficiently, eliminating redundant contact information between adjacent MPI processes. The code executes on multiple US Department of Defense (DoD) supercomputers and tests up to 2048 compute nodes for simulating 10 million three-axis ellipsoidal particles. Performance analyses of the code including speedup, efficiency, scalability, and granularity across five orders of magnitude of simulation scale (number of particles) are provided, and they demonstrate high speedup and excellent scalability. It is also discovered that communication time is a decreasing function of the number of compute nodes in strong scaling measurements. The code's capability of simulating a large number of complex-shaped particles on modern supercomputers will be of value in both laboratory studies on micromechanical properties of granular materials and many realistic engineering applications involving granular materials.
Fabrication and electrical characterization of a MOS memory device containing self-assembled metallic nanoparticles

NASA Astrophysics Data System (ADS)

Sargentis, Ch.; Giannakopoulos, K.; Travlos, A.; Tsamakis, D.

2007-04-01

Floating gate devices with nanoparticles embedded in dielectrics have recently attracted much attention due to the fact that these devices operate as non-volatile memories with high speed, high density and low power consumption. In this paper, memory devices containing gold (Au) nanoparticles have been fabricated using e-gun evaporation. The Au nanoparticles are deposited on a very thin SiO 2 layer and are then fully covered by a HfO 2 layer. The HfO 2 is a high- k dielectric and gives good scalability to the fabricated devices. We studied the effect of the deposition parameters to the size and the shape of the Au nanoparticles using capacitance-voltage and conductance-voltage measurements, we demonstrated that the fabricated device can indeed operate as a low-voltage memory device.
Interactive Volume Exploration of Petascale Microscopy Data Streams Using a Visualization-Driven Virtual Memory Approach.

PubMed

Hadwiger, M; Beyer, J; Jeong, Won-Ki; Pfister, H

2012-12-01

This paper presents the first volume visualization system that scales to petascale volumes imaged as a continuous stream of high-resolution electron microscopy images. Our architecture scales to dense, anisotropic petascale volumes because it: (1) decouples construction of the 3D multi-resolution representation required for visualization from data acquisition, and (2) decouples sample access time during ray-casting from the size of the multi-resolution hierarchy. Our system is designed around a scalable multi-resolution virtual memory architecture that handles missing data naturally, does not pre-compute any 3D multi-resolution representation such as an octree, and can accept a constant stream of 2D image tiles from the microscopes. A novelty of our system design is that it is visualization-driven: we restrict most computations to the visible volume data. Leveraging the virtual memory architecture, missing data are detected during volume ray-casting as cache misses, which are propagated backwards for on-demand out-of-core processing. 3D blocks of volume data are only constructed from 2D microscope image tiles when they have actually been accessed during ray-casting. We extensively evaluate our system design choices with respect to scalability and performance, compare to previous best-of-breed systems, and illustrate the effectiveness of our system for real microscopy data from neuroscience.
Twin-bit via resistive random access memory in 16 nm FinFET logic technologies

NASA Astrophysics Data System (ADS)

Shih, Yi-Hong; Hsu, Meng-Yin; King, Ya-Chin; Lin, Chrong Jung

2018-04-01

A via resistive random access memory (RRAM) cell fully compatible with the standard CMOS logic process has been successfully demonstrated for high-density logic nonvolatile memory (NVM) modules in advanced FinFET circuits. In this new cell, the transition metal layers are formed on both sides of a via, given two storage bits per via. In addition to its compact cell area (1T + 14 nm × 32 nm), the twin-bit via RRAM cell features a low operation voltage, a large read window, good data retention, and excellent cycling capability. As fine alignments between mask layers become possible, the twin-bit via RRAM cell is expected to be highly scalable in advanced FinFET technology.
A Collaborative Web-Based Architecture For Sharing ToxCast Data

EPA Science Inventory

Collaborative Drug Discovery (CDD) has created a scalable platform that combines traditional drug discovery informatics with Web2.0 features. Traditional drug discovery capabilities include substructure, similarity searching and export to excel or sdf formats. Web2.0 features inc...
Interference due to shared features between action plans is influenced by working memory span.

PubMed

Fournier, Lisa R; Behmer, Lawrence P; Stubblefield, Alexandra M

2014-12-01

In this study, we examined the interactions between the action plans that we hold in memory and the actions that we carry out, asking whether the interference due to shared features between action plans is due to selection demands imposed on working memory. Individuals with low and high working memory spans learned arbitrary motor actions in response to two different visual events (A and B), presented in a serial order. They planned a response to the first event (A) and while maintaining this action plan in memory they then executed a speeded response to the second event (B). Afterward, they executed the action plan for the first event (A) maintained in memory. Speeded responses to the second event (B) were delayed when it shared an action feature (feature overlap) with the first event (A), relative to when it did not (no feature overlap). The size of the feature-overlap delay was greater for low-span than for high-span participants. This indicates that interference due to overlapping action plans is greater when fewer working memory resources are available, suggesting that this interference is due to selection demands imposed on working memory. Thus, working memory plays an important role in managing current and upcoming action plans, at least for newly learned tasks. Also, managing multiple action plans is compromised in individuals who have low versus high working memory spans.

Destination Memory Impairment in Older People

PubMed Central

Gopie, Nigel; Craik, Fergus I. M.; Hasher, Lynn

2012-01-01

Older adults are assumed to have poor destination memory— knowing to whom they tell particular information—and anecdotes about them repeating stories to the same people are cited as informal evidence for this claim. Experiment 1 assessed young and older adults’ destination memory by having participants tell facts (e.g., “A dime has 118 ridges around its edge”) to pictures of famous people (e.g., Oprah Winfrey). Surprise recognition memory tests, which also assessed confidence, revealed that older adults, compared to young adults, were disproportionately impaired on destination memory relative to spared memory for the individual components (i.e., facts, faces) of the episode. Older adults also were more confident that they had not told a fact to a particular person when they actually had (i.e., a miss); this presumably causes them to repeat information more often than young adults. When the direction of information transfer was reversed in Experiment 2, such that the famous people shared information with the participants (i.e., a source memory experiment), age-related memory differences disappeared. In contrast to the destination memory experiment, older adults in the source memory experiment were more confident than young adults that someone had shared a fact with them when a different person actually had shared the fact (i.e., a false alarm). Overall, accuracy and confidence jointly influence age-related changes to destination memory, a fundamental component of successful communication. PMID:20718537
Think globally and solve locally: secondary memory-based network learning for automated multi-species function prediction

PubMed Central

2014-01-01

Background Network-based learning algorithms for automated function prediction (AFP) are negatively affected by the limited coverage of experimental data and limited a priori known functional annotations. As a consequence their application to model organisms is often restricted to well characterized biological processes and pathways, and their effectiveness with poorly annotated species is relatively limited. A possible solution to this problem might consist in the construction of big networks including multiple species, but this in turn poses challenging computational problems, due to the scalability limitations of existing algorithms and the main memory requirements induced by the construction of big networks. Distributed computation or the usage of big computers could in principle respond to these issues, but raises further algorithmic problems and require resources not satisfiable with simple off-the-shelf computers. Results We propose a novel framework for scalable network-based learning of multi-species protein functions based on both a local implementation of existing algorithms and the adoption of innovative technologies: we solve “locally” the AFP problem, by designing “vertex-centric” implementations of network-based algorithms, but we do not give up thinking “globally” by exploiting the overall topology of the network. This is made possible by the adoption of secondary memory-based technologies that allow the efficient use of the large memory available on disks, thus overcoming the main memory limitations of modern off-the-shelf computers. This approach has been applied to the analysis of a large multi-species network including more than 300 species of bacteria and to a network with more than 200,000 proteins belonging to 13 Eukaryotic species. To our knowledge this is the first work where secondary-memory based network analysis has been applied to multi-species function prediction using biological networks with hundreds of thousands of proteins. Conclusions The combination of these algorithmic and technological approaches makes feasible the analysis of large multi-species networks using ordinary computers with limited speed and primary memory, and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using well-equipped stand-alone machines. PMID:24843788
A Distributed Platform for Global-Scale Agent-Based Models of Disease Transmission

PubMed Central

Parker, Jon; Epstein, Joshua M.

2013-01-01

The Global-Scale Agent Model (GSAM) is presented. The GSAM is a high-performance distributed platform for agent-based epidemic modeling capable of simulating a disease outbreak in a population of several billion agents. It is unprecedented in its scale, its speed, and its use of Java. Solutions to multiple challenges inherent in distributing massive agent-based models are presented. Communication, synchronization, and memory usage are among the topics covered in detail. The memory usage discussion is Java specific. However, the communication and synchronization discussions apply broadly. We provide benchmarks illustrating the GSAM’s speed and scalability. PMID:24465120
Low latency memory access and synchronization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.

A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processormore » only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.« less
Low latency memory access and synchronization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.

A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Bach processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processormore » only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.« less
Location-Unbound Color-Shape Binding Representations in Visual Working Memory.

PubMed

Saiki, Jun

2016-02-01

The mechanism by which nonspatial features, such as color and shape, are bound in visual working memory, and the role of those features' location in their binding, remains unknown. In the current study, I modified a redundancy-gain paradigm to investigate these issues. A set of features was presented in a two-object memory display, followed by a single object probe. Participants judged whether the probe contained any features of the memory display, regardless of its location. Response time distributions revealed feature coactivation only when both features of a single object in the memory display appeared together in the probe, regardless of the response time benefit from the probe and memory objects sharing the same location. This finding suggests that a shared location is necessary in the formation of bound representations but unnecessary in their maintenance. Electroencephalography data showed that amplitude modulations reflecting location-unbound feature coactivation were different from those reflecting the location-sharing benefit, consistent with the behavioral finding that feature-location binding is unnecessary in the maintenance of color-shape binding. © The Author(s) 2015.
Shared reality in intergroup communication: Increasing the epistemic authority of an out-group audience.

PubMed

Echterhoff, Gerald; Kopietz, René; Higgins, E Tory

2017-06-01

Communicators typically tune messages to their audience's attitude. Such audience tuning biases communicators' memory for the topic toward the audience's attitude to the extent that they create a shared reality with the audience. To investigate shared reality in intergroup communication, we first established that a reduced memory bias after tuning messages to an out-group (vs. in-group) audience is a subtle index of communicators' denial of shared reality to that out-group audience (Experiments 1a and 1b). We then examined whether the audience-tuning memory bias might emerge when the out-group audience's epistemic authority is enhanced, either by increasing epistemic expertise concerning the communication topic or by creating epistemic consensus among members of a multiperson out-group audience. In Experiment 2, when Germans communicated to a Turkish audience with an attitude about a Turkish (vs. German) target, the audience-tuning memory bias appeared. In Experiment 3, when the audience of German communicators consisted of 3 Turks who all held the same attitude toward the target, the memory bias again appeared. The association between message valence and memory valence was consistently higher when the audience's epistemic authority was high (vs. low). An integrative analysis across all studies also suggested that the memory bias increases with increasing strength of epistemic inputs (epistemic expertise, epistemic consensus, and audience-tuned message production). The findings suggest novel ways of overcoming intergroup biases in intergroup relations. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
The Developmental Influence of Primary Memory Capacity on Working Memory and Academic Achievement

PubMed Central

2015-01-01

In this study, we investigate the development of primary memory capacity among children. Children between the ages of 5 and 8 completed 3 novel tasks (split span, interleaved lists, and a modified free-recall task) that measured primary memory by estimating the number of items in the focus of attention that could be spontaneously recalled in serial order. These tasks were calibrated against traditional measures of simple and complex span. Clear age-related changes in these primary memory estimates were observed. There were marked individual differences in primary memory capacity, but each novel measure was predictive of simple span performance. Among older children, each measure shared variance with reading and mathematics performance, whereas for younger children, the interleaved lists task was the strongest single predictor of academic ability. We argue that these novel tasks have considerable potential for the measurement of primary memory capacity and provide new, complementary ways of measuring the transient memory processes that predict academic performance. The interleaved lists task also shared features with interference control tasks, and our findings suggest that young children have a particular difficulty in resisting distraction and that variance in the ability to resist distraction is also shared with measures of educational attainment. PMID:26075630
The developmental influence of primary memory capacity on working memory and academic achievement.

PubMed

Hall, Debbora; Jarrold, Christopher; Towse, John N; Zarandi, Amy L

2015-08-01

In this study, we investigate the development of primary memory capacity among children. Children between the ages of 5 and 8 completed 3 novel tasks (split span, interleaved lists, and a modified free-recall task) that measured primary memory by estimating the number of items in the focus of attention that could be spontaneously recalled in serial order. These tasks were calibrated against traditional measures of simple and complex span. Clear age-related changes in these primary memory estimates were observed. There were marked individual differences in primary memory capacity, but each novel measure was predictive of simple span performance. Among older children, each measure shared variance with reading and mathematics performance, whereas for younger children, the interleaved lists task was the strongest single predictor of academic ability. We argue that these novel tasks have considerable potential for the measurement of primary memory capacity and provide new, complementary ways of measuring the transient memory processes that predict academic performance. The interleaved lists task also shared features with interference control tasks, and our findings suggest that young children have a particular difficulty in resisting distraction and that variance in the ability to resist distraction is also shared with measures of educational attainment. (c) 2015 APA, all rights reserved).
Conditional load and store in a shared memory

DOEpatents

Blumrich, Matthias A; Ohmacht, Martin

2015-02-03

A method, system and computer program product for implementing load-reserve and store-conditional instructions in a multi-processor computing system. The computing system includes a multitude of processor units and a shared memory cache, and each of the processor units has access to the memory cache. In one embodiment, the method comprises providing the memory cache with a series of reservation registers, and storing in these registers addresses reserved in the memory cache for the processor units as a result of issuing load-reserve requests. In this embodiment, when one of the processor units makes a request to store data in the memory cache using a store-conditional request, the reservation registers are checked to determine if an address in the memory cache is reserved for that processor unit. If an address in the memory cache is reserved for that processor, the data are stored at this address.
Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures

NASA Technical Reports Server (NTRS)

Biegel, Bryan A. (Technical Monitor); Jost, G.; Jin, H.; Labarta J.; Gimenez, J.; Caubet, J.

2003-01-01

Parallel programming paradigms include process level parallelism, thread level parallelization, and multilevel parallelism. This viewgraph presentation describes a detailed performance analysis of these paradigms for Shared Memory Architecture (SMA). This analysis uses the Paraver Performance Analysis System. The presentation includes diagrams of a flow of useful computations.
Measuring Transactiving Memory Systems Using Network Analysis

ERIC Educational Resources Information Center

King, Kylie Goodell

2017-01-01

Transactive memory systems (TMSs) describe the structures and processes that teams use to share information, work together, and accomplish shared goals. First introduced over three decades ago, TMSs have been measured in a variety of ways. This dissertation proposes the use of network analysis in measuring TMS. This is accomplished by describing…
Operator Influence of Unexploded Ordnance Sensor Technologies

DTIC Science & Technology

2007-03-01

chart display ActiveX control Mscomct2.dll – date/time display ActiveX control Pnpscr.dll – Systran SCRAMNet replicated shared memory device...response value database rgm_p2.dll – Phase 2 shared memory API and implementation Commercial components StripM.ocx – strip chart display ActiveX
Concurrent working memory load can facilitate selective attention: evidence for specialized load.

PubMed

Park, Soojin; Kim, Min-Shik; Chun, Marvin M

2007-10-01

Load theory predicts that concurrent working memory load impairs selective attention and increases distractor interference (N. Lavie, A. Hirst, J. W. de Fockert, & E. Viding). Here, the authors present new evidence that the type of concurrent working memory load determines whether load impairs selective attention or not. Working memory load was paired with a same/different matching task that required focusing on targets while ignoring distractors. When working memory items shared the same limited-capacity processing mechanisms with targets in the matching task, distractor interference increased. However, when working memory items shared processing with distractors in the matching task, distractor interference decreased, facilitating target selection. A specialized load account is proposed to describe the dissociable effects of working memory load on selective processing depending on whether the load overlaps with targets or with distractors. (c) 2007 APA
Transactive memory systems scale for couples: development and validation

PubMed Central

Hewitt, Lauren Y.; Roberts, Lynne D.

2015-01-01

People in romantic relationships can develop shared memory systems by pooling their cognitive resources, allowing each person access to more information but with less cognitive effort. Research examining such memory systems in romantic couples largely focuses on remembering word lists or performing lab-based tasks, but these types of activities do not capture the processes underlying couples’ transactive memory systems, and may not be representative of the ways in which romantic couples use their shared memory systems in everyday life. We adapted an existing measure of transactive memory systems for use with romantic couples (TMSS-C), and conducted an initial validation study. In total, 397 participants who each identified as being a member of a romantic relationship of at least 3 months duration completed the study. The data provided a good fit to the anticipated three-factor structure of the components of couples’ transactive memory systems (specialization, credibility and coordination), and there was reasonable evidence of both convergent and divergent validity, as well as strong evidence of test–retest reliability across a 2-week period. The TMSS-C provides a valuable tool that can quickly and easily capture the underlying components of romantic couples’ transactive memory systems. It has potential to help us better understand this intriguing feature of romantic relationships, and how shared memory systems might be associated with other important features of romantic relationships. PMID:25999873
HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.

PubMed

O'Driscoll, Aisling; Belogrudov, Vladislav; Carroll, John; Kropp, Kai; Walsh, Paul; Ghazal, Peter; Sleator, Roy D

2015-04-01

The recent exponential growth of genomic databases has resulted in the common task of sequence alignment becoming one of the major bottlenecks in the field of computational biology. It is typical for these large datasets and complex computations to require cost prohibitive High Performance Computing (HPC) to function. As such, parallelised solutions have been proposed but many exhibit scalability limitations and are incapable of effectively processing "Big Data" - the name attributed to datasets that are extremely large, complex and require rapid processing. The Hadoop framework, comprised of distributed storage and a parallelised programming framework known as MapReduce, is specifically designed to work with such datasets but it is not trivial to efficiently redesign and implement bioinformatics algorithms according to this paradigm. The parallelisation strategy of "divide and conquer" for alignment algorithms can be applied to both data sets and input query sequences. However, scalability is still an issue due to memory constraints or large databases, with very large database segmentation leading to additional performance decline. Herein, we present Hadoop Blast (HBlast), a parallelised BLAST algorithm that proposes a flexible method to partition both databases and input query sequences using "virtual partitioning". HBlast presents improved scalability over existing solutions and well balanced computational work load while keeping database segmentation and recompilation to a minimum. Enhanced BLAST search performance on cheap memory constrained hardware has significant implications for in field clinical diagnostic testing; enabling faster and more accurate identification of pathogenic DNA in human blood or tissue samples. Copyright © 2015 Elsevier Inc. All rights reserved.
The future of memory

NASA Astrophysics Data System (ADS)

Marinella, M.

In the not too distant future, the traditional memory and storage hierarchy of may be replaced by a single Storage Class Memory (SCM) device integrated on or near the logic processor. Traditional magnetic hard drives, NAND flash, DRAM, and higher level caches (L2 and up) will be replaced with a single high performance memory device. The Storage Class Memory paradigm will require high speed (< 100 ns read/write), excellent endurance (> 1012), nonvolatility (retention > 10 years), and low switching energies (< 10 pJ per switch). The International Technology Roadmap for Semiconductors (ITRS) has recently evaluated several potential candidates SCM technologies, including Resistive (or Redox) RAM, Spin Torque Transfer RAM (STT-MRAM), and phase change memory (PCM). All of these devices show potential well beyond that of current flash technologies and research efforts are underway to improve the endurance, write speeds, and scalabilities to be on-par with DRAM. This progress has interesting implications for space electronics: each of these emerging device technologies show excellent resistance to the types of radiation typically found in space applications. Commercially developed, high density storage class memory-based systems may include a memory that is physically radiation hard, and suitable for space applications without major shielding efforts. This paper reviews the Storage Class Memory concept, emerging memory devices, and possible applicability to radiation hardened electronics for space.
Scalable cross-point resistive switching memory and mechanism through an understanding of H2O2/glucose sensing using an IrOx/Al2O3/W structure.

PubMed

Chakrabarti, Somsubhra; Maikap, Siddheswar; Samanta, Subhranu; Jana, Surajit; Roy, Anisha; Qiu, Jian-Tai

2017-10-04

The resistive switching characteristics of a scalable IrO x /Al 2 O 3 /W cross-point structure and its mechanism for pH/H 2 O 2 sensing along with glucose detection have been investigated for the first time. Porous IrO x and Ir 3+ /Ir 4+ oxidation states are observed via high-resolution transmission electron microscope, field-emission scanning electron spectroscopy, and X-ray photo-electron spectroscopy. The 20 nm-thick IrO x devices in sidewall contact show consecutive long dc cycles at a low current compliance (CC) of 10 μA, multi-level operation with CC varying from 10 μA to 100 μA, and long program/erase endurance of >10 9 cycles with 100 ns pulse width. IrO x with a thickness of 2 nm in the IrO x /Al 2 O 3 /SiO 2 /p-Si structure has shown super-Nernstian pH sensitivity of 115 mV per pH, and detection of H 2 O 2 over the range of 1-100 nM is also achieved owing to the porous and reduction-oxidation (redox) characteristics of the IrO x membrane, whereas a pure Al 2 O 3 /SiO 2 membrane does not show H 2 O 2 sensing. A simulation based on Schottky, hopping, and Fowler-Nordheim tunneling conduction, and a redox reaction, is proposed. The experimental I-V curve matches very well with simulation. The resistive switching mechanism is owing to O 2- ion migration, and the redox reaction of Ir 3+ /Ir 4+ at the IrO x /Al 2 O 3 interface through H 2 O 2 sensing as well as Schottky barrier height modulation is responsible. Glucose at a low concentration of 10 pM is detected using a completely new process in the IrO x /Al 2 O 3 /W cross-point structure. Therefore, this cross-point memory shows a method for low cost, scalable, memory with low current, multi-level operation, which will be useful for future highly dense three-dimensional (3D) memory and as a bio-sensor for the future diagnosis of human diseases.
DMA shared byte counters in a parallel computer

DOEpatents

Chen, Dong; Gara, Alan G.; Heidelberger, Philip; Vranas, Pavlos

2010-04-06

A parallel computer system is constructed as a network of interconnected compute nodes. Each of the compute nodes includes at least one processor, a memory and a DMA engine. The DMA engine includes a processor interface for interfacing with the at least one processor, DMA logic, a memory interface for interfacing with the memory, a DMA network interface for interfacing with the network, injection and reception byte counters, injection and reception FIFO metadata, and status registers and control registers. The injection FIFOs maintain memory locations of the injection FIFO metadata memory locations including its current head and tail, and the reception FIFOs maintain the reception FIFO metadata memory locations including its current head and tail. The injection byte counters and reception byte counters may be shared between messages.
Division of attention as a function of the number of steps, visual shifts, and memory load

NASA Technical Reports Server (NTRS)

Chechile, R. A.; Butler, K.; Gutowski, W.; Palmer, E. A.

1986-01-01

The effects on divided attention of visual shifts and long-term memory retrieval during a monitoring task are considered. A concurrent vigilance task was standardized under all experimental conditions. The results show that subjects can perform nearly perfectly on all of the time-shared tasks if long-term memory retrieval is not required for monitoring. With the requirement of memory retrieval, however, there was a large decrease in accuracy for all of the time-shared activities. It was concluded that the attentional demand of longterm memory retrieval is appreciable (even for a well-learned motor sequence), and thus memory retrieval results in a sizable reduction in the capability of subjects to divide their attention. A selected bibliography on the divided attention literature is provided.

Fabrication of cross-shaped Cu-nanowire resistive memory devices using a rapid, scalable, and designable inorganic-nanowire-digital-alignment technique (Conference Presentation)

NASA Astrophysics Data System (ADS)

Xu, Wentao; Lee, Yeongjun; Min, Sung-Yong; Park, Cheolmin; Lee, Tae-Woo

2016-09-01

Resistive random-access memory (RRAM) is a candidate next generation nonvolatile memory due to its high access speed, high density and ease of fabrication. Especially, cross-point-access allows cross-bar arrays that lead to high-density cells in a two-dimensional planar structure. Use of such designs could be compatible with the aggressive scaling down of memory devices, but existing methods such as optical or e-beam lithographic approaches are too complicated. One-dimensional inorganic nanowires (i-NWs) are regarded as ideal components of nanoelectronics to circumvent the limitations of conventional lithographic approaches. However, post-growth alignment of these i-NWs precisely on a large area with individual control is still a difficult challenge. Here, we report a simple, inexpensive, and rapid method to fabricate two-dimensional arrays of perpendicularly-aligned, individually-conductive Cu-NWs with a nanometer-scale CuxO layer sandwiched at each cross point, by using an inorganic-nanowire-digital-alignment technique (INDAT) and a one-step reduction process. In this approach, the oxide layer is self-formed and patterned, so conventional deposition and lithography are not necessary. INDAT eliminates the difficulties of alignment and scalable fabrication that are encountered when using currently-available techniques that use inorganic nanowires. This simple process facilitates fabrication of cross-point nonvolatile memristor arrays. Fabricated arrays had reproducible resistive switching behavior, high on/off current ratio (Ion/Ioff) 10 6 and extensive cycling endurance. This is the first report of memristors with the resistive switching oxide layer self-formed, self-patterned and self-positioned; we envision that the new features of the technique will provide great opportunities for future nano-electronic circuits.
Welcoming nora: a family event.

PubMed

Walsh, Allison J; Walsh, Paul R; Walsh, Jane M; Walsh, Gavin T

2011-01-01

In this column, Allison and Paul Walsh share the story of the birth of Nora, their third baby and their second child to be born at home. Allison and Paul share their individual memories of labor and birth. But their story is only part of the story of Nora's birth. Nora's birth was a family event, with Allison and Paul's other children very much part of the experience. Jane and Gavin share their own memories of their baby sister's birth.
A study of the switching mechanism and electrode material of fully CMOS compatible tungsten oxide ReRAM

NASA Astrophysics Data System (ADS)

Chien, W. C.; Chen, Y. C.; Lai, E. K.; Lee, F. M.; Lin, Y. Y.; Chuang, Alfred T. H.; Chang, K. P.; Yao, Y. D.; Chou, T. H.; Lin, H. M.; Lee, M. H.; Shih, Y. H.; Hsieh, K. Y.; Lu, Chih-Yuan

2011-03-01

Tungsten oxide (WO X ) resistive memory (ReRAM), a two-terminal CMOS compatible nonvolatile memory, has shown promise to surpass the existing flash memory in terms of scalability, switching speed, and potential for 3D stacking. The memory layer, WO X , can be easily fabricated by down-stream plasma oxidation (DSPO) or rapid thermal oxidation (RTO) of W plugs universally used in CMOS circuits. Results of conductive AFM (C-AFM) experiment suggest the switching mechanism is dominated by the REDOX (Reduction-oxidation) reaction—the creation of conducting filaments leads to a low resistance state and the rupturing of the filaments results in a high resistance state. Our experimental results show that the reactions happen at the TE/WO X interface. With this understanding in mind, we proposed two approaches to boost the memory performance: (i) using DSPO to treat the RTO WO X surface and (ii) using Pt TE, which forms a Schottky barrier with WO X . Both approaches, especially the latter, significantly reduce the forming current and enlarge the memory window.
Recent trends in hardware security exploiting hybrid CMOS-resistive memory circuits

NASA Astrophysics Data System (ADS)

Sahay, Shubham; Suri, Manan

2017-12-01

This paper provides a comprehensive review and insight of recent trends in the field of random number generator (RNG) and physically unclonable function (PUF) circuits implemented using different types of emerging resistive non-volatile (NVM) memory devices. We present a detailed review of hybrid RNG/PUF implementations based on the use of (i) Spin-Transfer Torque (STT-MRAM), and (ii) metal-oxide based (OxRAM), NVM devices. Various approaches on Hybrid CMOS-NVM RNG/PUF circuits are considered, followed by a discussion on different nanoscale device phenomena. Certain nanoscale device phenomena (variability/stochasticity etc), which are otherwise undesirable for reliable memory and storage applications, form the basis for low power and highly scalable RNG/PUF circuits. Detailed qualitative comparison and benchmarking of all implementations is performed.
Electrically-controlled nonlinear switching and multi-level storage characteristics in WOx film-based memory cells

NASA Astrophysics Data System (ADS)

Duan, W. J.; Wang, J. B.; Zhong, X. L.

2018-05-01

Resistive switching random access memory (RRAM) is considered as a promising candidate for the next generation memory due to its scalability, high integration density and non-volatile storage characteristics. Here, the multiple electrical characteristics in Pt/WOx/Pt cells are investigated. Both of the nonlinear switching and multi-level storage can be achieved by setting different compliance current in the same cell. The correlations among the current, time and temperature are analyzed by using contours and 3D surfaces. The switching mechanism is explained in terms of the formation and rupture of conductive filament which is related to oxygen vacancies. The experimental results show that the non-stoichiometric WOx film-based device offers a feasible way for the applications of oxide-based RRAMs.
Logic computation in phase change materials by threshold and memory switching.

PubMed

Cassinerio, M; Ciocchini, N; Ielmini, D

2013-11-06

Memristors, namely hysteretic devices capable of changing their resistance in response to applied electrical stimuli, may provide new opportunities for future memory and computation, thanks to their scalable size, low switching energy and nonvolatile nature. We have developed a functionally complete set of logic functions including NOR, NAND and NOT gates, each utilizing a single phase-change memristor (PCM) where resistance switching is due to the phase transformation of an active chalcogenide material. The logic operations are enabled by the high functionality of nanoscale phase change, featuring voltage comparison, additive crystallization and pulse-induced amorphization. The nonvolatile nature of memristive states provides the basis for developing reconfigurable hybrid logic/memory circuits featuring low-power and high-speed switching. © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Spectral multiplexing for scalable quantum photonics using an atomic frequency comb quantum memory and feed-forward control.

PubMed

Sinclair, Neil; Saglamyurek, Erhan; Mallahzadeh, Hassan; Slater, Joshua A; George, Mathew; Ricken, Raimund; Hedges, Morgan P; Oblak, Daniel; Simon, Christoph; Sohler, Wolfgang; Tittel, Wolfgang

2014-08-01

Future multiphoton applications of quantum optics and quantum information science require quantum memories that simultaneously store many photon states, each encoded into a different optical mode, and enable one to select the mapping between any input and a specific retrieved mode during storage. Here we show, with the example of a quantum repeater, how to employ spectrally multiplexed states and memories with fixed storage times that allow such mapping between spectral modes. Furthermore, using a Ti:Tm:LiNbO_{3} waveguide cooled to 3 K, a phase modulator, and a spectral filter, we demonstrate storage followed by the required feed-forward-controlled frequency manipulation with time-bin qubits encoded into up to 26 multiplexed spectral modes and 97% fidelity.
RF assisted switching in magnetic Josephson junctions

NASA Astrophysics Data System (ADS)

Caruso, R.; Massarotti, D.; Bolginov, V. V.; Ben Hamida, A.; Karelina, L. N.; Miano, A.; Vernik, I. V.; Tafuri, F.; Ryazanov, V. V.; Mukhanov, O. A.; Pepe, G. P.

2018-04-01

We test the effect of an external RF field on the switching processes of magnetic Josephson junctions (MJJs) suitable for the realization of fast, scalable cryogenic memories compatible with Single Flux Quantum logic. We show that the combined application of microwaves and magnetic field pulses can improve the performances of the device, increasing the separation between the critical current levels corresponding to logical "0" and "1." The enhancement of the current level separation can be as high as 80% using an optimal set of parameters. We demonstrate that external RF fields can be used as an additional tool to manipulate the memory states, and we expect that this approach may lead to the development of new methods of selecting MJJs and manipulating their states in memory arrays for various applications.
Colouring in the Blanks: Memory Drawings of the 1990 Kuwait Invasion

ERIC Educational Resources Information Center

Pepin-Wakefield, Yvonne

2009-01-01

This study used drawing tasks to examine the similarities and differences between females and males who shared a collective traumatic event in early childhood. Could these childhood memories be recorded, measured, and compared for gender differences in drawings by young adults who had shared a similar experience as children? Exploration of this…
Functions of Memory Sharing and Mother-Child Reminiscing Behaviors: Individual and Cultural Variations

ERIC Educational Resources Information Center

Kulkofsky, Sarah; Wang, Qi; Koh, Jessie Bee Kim

2009-01-01

This study examined maternal beliefs about the functions of memory sharing and the relations between these beliefs and mother-child reminiscing behaviors in a cross-cultural context. Sixty-three European American and 47 Chinese mothers completed an open-ended questionnaire concerning their beliefs about the functions of parent-child memory…
Stillbirth and stigma: the spoiling and repair of multiple social identities.

PubMed

Brierley-Jones, Lyn; Crawley, Rosalind; Lomax, Samantha; Ayers, Susan

This study investigated mothers' experiences surrounding stillbirth in the United Kingdom, their memory making and sharing opportunities, and the effect these opportunities had on them. Qualitative data were generated from free text responses to open-ended questions. Thematic content analysis revealed that "stigma" was experienced by most women and Goffman's (1963) work on stigma was subsequently used as an analytical framework. Results suggest that stillbirth can spoil the identities of "patient," "mother," and "full citizen." Stigma was reported as arising from interactions with professionals, family, friends, work colleagues, and even casual acquaintances. Stillbirth produces common learning experiences often requiring "identity work" (Murphy, 2012). Memory making and sharing may be important in this work and further research is needed. Stigma can reduce the memory sharing opportunities for women after stillbirth and this may explain some of the differential mental health effects of memory making after stillbirth that is documented in the literature.
Parallelization of KENO-Va Monte Carlo code

NASA Astrophysics Data System (ADS)

Ramón, Javier; Peña, Jorge

1995-07-01

KENO-Va is a code integrated within the SCALE system developed by Oak Ridge that solves the transport equation through the Monte Carlo Method. It is being used at the Consejo de Seguridad Nuclear (CSN) to perform criticality calculations for fuel storage pools and shipping casks. Two parallel versions of the code: one for shared memory machines and other for distributed memory systems using the message-passing interface PVM have been generated. In both versions the neutrons of each generation are tracked in parallel. In order to preserve the reproducibility of the results in both versions, advanced seeds for random numbers were used. The CONVEX C3440 with four processors and shared memory at CSN was used to implement the shared memory version. A FDDI network of 6 HP9000/735 was employed to implement the message-passing version using proprietary PVM. The speedup obtained was 3.6 in both cases.
Long-range interactions and parallel scalability in molecular simulations

NASA Astrophysics Data System (ADS)

Patra, Michael; Hyvönen, Marja T.; Falck, Emma; Sabouri-Ghomi, Mohsen; Vattulainen, Ilpo; Karttunen, Mikko

2007-01-01

Typical biomolecular systems such as cellular membranes, DNA, and protein complexes are highly charged. Thus, efficient and accurate treatment of electrostatic interactions is of great importance in computational modeling of such systems. We have employed the GROMACS simulation package to perform extensive benchmarking of different commonly used electrostatic schemes on a range of computer architectures (Pentium-4, IBM Power 4, and Apple/IBM G5) for single processor and parallel performance up to 8 nodes—we have also tested the scalability on four different networks, namely Infiniband, GigaBit Ethernet, Fast Ethernet, and nearly uniform memory architecture, i.e. communication between CPUs is possible by directly reading from or writing to other CPUs' local memory. It turns out that the particle-mesh Ewald method (PME) performs surprisingly well and offers competitive performance unless parallel runs on PC hardware with older network infrastructure are needed. Lipid bilayers of sizes 128, 512 and 2048 lipid molecules were used as the test systems representing typical cases encountered in biomolecular simulations. Our results enable an accurate prediction of computational speed on most current computing systems, both for serial and parallel runs. These results should be helpful in, for example, choosing the most suitable configuration for a small departmental computer cluster.
Scalable PGAS Metadata Management on Extreme Scale Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chavarría-Miranda, Daniel; Agarwal, Khushbu; Straatsma, TP

Programming models intended to run on exascale systems have a number of challenges to overcome, specially the sheer size of the system as measured by the number of concurrent software entities created and managed by the underlying runtime. It is clear from the size of these systems that any state maintained by the programming model has to be strictly sub-linear in size, in order not to overwhelm memory usage with pure overhead. A principal feature of Partitioned Global Address Space (PGAS) models is providing easy access to global-view distributed data structures. In order to provide efficient access to these distributedmore » data structures, PGAS models must keep track of metadata such as where array sections are located with respect to processes/threads running on the HPC system. As PGAS models and applications become ubiquitous on very large transpetascale systems, a key component to their performance and scalability will be efficient and judicious use of memory for model overhead (metadata) compared to application data. We present an evaluation of several strategies to manage PGAS metadata that exhibit different space/time tradeoffs. We use two real-world PGAS applications to capture metadata usage patterns and gain insight into their communication behavior.« less
Scalable Algorithms for Clustering Large Geospatiotemporal Data Sets on Manycore Architectures

NASA Astrophysics Data System (ADS)

Mills, R. T.; Hoffman, F. M.; Kumar, J.; Sreepathi, S.; Sripathi, V.

2016-12-01

The increasing availability of high-resolution geospatiotemporal data sets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery using data sets fused from disparate sources. Traditional algorithms and computing platforms are impractical for the analysis and synthesis of data sets of this size; however, new algorithmic approaches that can effectively utilize the complex memory hierarchies and the extremely high levels of available parallelism in state-of-the-art high-performance computing platforms can enable such analysis. We describe a massively parallel implementation of accelerated k-means clustering and some optimizations to boost computational intensity and utilization of wide SIMD lanes on state-of-the art multi- and manycore processors, including the second-generation Intel Xeon Phi ("Knights Landing") processor based on the Intel Many Integrated Core (MIC) architecture, which includes several new features, including an on-package high-bandwidth memory. We also analyze the code in the context of a few practical applications to the analysis of climatic and remotely-sensed vegetation phenology data sets, and speculate on some of the new applications that such scalable analysis methods may enable.
Cooperative Learning for Distributed In-Network Traffic Classification

NASA Astrophysics Data System (ADS)

Joseph, S. B.; Loo, H. R.; Ismail, I.; Andromeda, T.; Marsono, M. N.

2017-04-01

Inspired by the concept of autonomic distributed/decentralized network management schemes, we consider the issue of information exchange among distributed network nodes to network performance and promote scalability for in-network monitoring. In this paper, we propose a cooperative learning algorithm for propagation and synchronization of network information among autonomic distributed network nodes for online traffic classification. The results show that network nodes with sharing capability perform better with a higher average accuracy of 89.21% (sharing data) and 88.37% (sharing clusters) compared to 88.06% for nodes without cooperative learning capability. The overall performance indicates that cooperative learning is promising for distributed in-network traffic classification.
Brain Information Sharing During Visual Short-Term Memory Binding Yields a Memory Biomarker for Familial Alzheimer's Disease.

PubMed

Parra, Mario A; Mikulan, Ezequiel; Trujillo, Natalia; Sala, Sergio Della; Lopera, Francisco; Manes, Facundo; Starr, John; Ibanez, Agustin

2017-01-01

Alzheimer's disease (AD) as a disconnection syndrome which disrupts both brain information sharing and memory binding functions. The extent to which these two phenotypic expressions share pathophysiological mechanisms remains unknown. To unveil the electrophysiological correlates of integrative memory impairments in AD towards new memory biomarkers for its prodromal stages. Patients with 100% risk of familial AD (FAD) and healthy controls underwent assessment with the Visual Short-Term Memory binding test (VSTMBT) while we recorded their EEG. We applied a novel brain connectivity method (Weighted Symbolic Mutual Information) to EEG data. Patients showed significant deficits during the VSTMBT. A reduction of brain connectivity was observed during resting as well as during correct VSTM binding, particularly over frontal and posterior regions. An increase of connectivity was found during VSTM binding performance over central regions. While decreased connectivity was found in cases in more advanced stages of FAD, increased brain connectivity appeared in cases in earlier stages. Such altered patterns of task-related connectivity were found in 89% of the assessed patients. VSTM binding in the prodromal stages of FAD are associated to altered patterns of brain connectivity thus confirming the link between integrative memory deficits and impaired brain information sharing in prodromal FAD. While significant loss of brain connectivity seems to be a feature of the advanced stages of FAD increased brain connectivity characterizes its earlier stages. These findings are discussed in the light of recent proposals about the earliest pathophysiological mechanisms of AD and their clinical expression. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
An FPGA-Based Massively Parallel Neuromorphic Cortex Simulator

PubMed Central

Wang, Runchun M.; Thakur, Chetan S.; van Schaik, André

2018-01-01

This paper presents a massively parallel and scalable neuromorphic cortex simulator designed for simulating large and structurally connected spiking neural networks, such as complex models of various areas of the cortex. The main novelty of this work is the abstraction of a neuromorphic architecture into clusters represented by minicolumns and hypercolumns, analogously to the fundamental structural units observed in neurobiology. Without this approach, simulating large-scale fully connected networks needs prohibitively large memory to store look-up tables for point-to-point connections. Instead, we use a novel architecture, based on the structural connectivity in the neocortex, such that all the required parameters and connections can be stored in on-chip memory. The cortex simulator can be easily reconfigured for simulating different neural networks without any change in hardware structure by programming the memory. A hierarchical communication scheme allows one neuron to have a fan-out of up to 200 k neurons. As a proof-of-concept, an implementation on one Altera Stratix V FPGA was able to simulate 20 million to 2.6 billion leaky-integrate-and-fire (LIF) neurons in real time. We verified the system by emulating a simplified auditory cortex (with 100 million neurons). This cortex simulator achieved a low power dissipation of 1.62 μW per neuron. With the advent of commercially available FPGA boards, our system offers an accessible and scalable tool for the design, real-time simulation, and analysis of large-scale spiking neural networks. PMID:29692702
An FPGA-Based Massively Parallel Neuromorphic Cortex Simulator.

PubMed

Wang, Runchun M; Thakur, Chetan S; van Schaik, André

2018-01-01

This paper presents a massively parallel and scalable neuromorphic cortex simulator designed for simulating large and structurally connected spiking neural networks, such as complex models of various areas of the cortex. The main novelty of this work is the abstraction of a neuromorphic architecture into clusters represented by minicolumns and hypercolumns, analogously to the fundamental structural units observed in neurobiology. Without this approach, simulating large-scale fully connected networks needs prohibitively large memory to store look-up tables for point-to-point connections. Instead, we use a novel architecture, based on the structural connectivity in the neocortex, such that all the required parameters and connections can be stored in on-chip memory. The cortex simulator can be easily reconfigured for simulating different neural networks without any change in hardware structure by programming the memory. A hierarchical communication scheme allows one neuron to have a fan-out of up to 200 k neurons. As a proof-of-concept, an implementation on one Altera Stratix V FPGA was able to simulate 20 million to 2.6 billion leaky-integrate-and-fire (LIF) neurons in real time. We verified the system by emulating a simplified auditory cortex (with 100 million neurons). This cortex simulator achieved a low power dissipation of 1.62 μW per neuron. With the advent of commercially available FPGA boards, our system offers an accessible and scalable tool for the design, real-time simulation, and analysis of large-scale spiking neural networks.
TopoMS: Comprehensive topological exploration for molecular and condensed-matter systems.

PubMed

Bhatia, Harsh; Gyulassy, Attila G; Lordi, Vincenzo; Pask, John E; Pascucci, Valerio; Bremer, Peer-Timo

2018-06-15

We introduce TopoMS, a computational tool enabling detailed topological analysis of molecular and condensed-matter systems, including the computation of atomic volumes and charges through the quantum theory of atoms in molecules, as well as the complete molecular graph. With roots in techniques from computational topology, and using a shared-memory parallel approach, TopoMS provides scalable, numerically robust, and topologically consistent analysis. TopoMS can be used as a command-line tool or with a GUI (graphical user interface), where the latter also enables an interactive exploration of the molecular graph. This paper presents algorithmic details of TopoMS and compares it with state-of-the-art tools: Bader charge analysis v1.0 (Arnaldsson et al., 01/11/17) and molecular graph extraction using Critic2 (Otero-de-la-Roza et al., Comput. Phys. Commun. 2014, 185, 1007). TopoMS not only combines the functionality of these individual codes but also demonstrates up to 4× performance gain on a standard laptop, faster convergence to fine-grid solution, robustness against lattice bias, and topological consistency. TopoMS is released publicly under BSD License. © 2018 Wiley Periodicals, Inc. © 2018 Wiley Periodicals, Inc.

Decomposing the relationship between cognitive functioning and self-referent memory beliefs in older adulthood: What’s memory got to do with it?

PubMed Central

Payne, Brennan R.; Gross, Alden L.; Hill, Patrick L.; Parisi, Jeanine M.; Rebok, George W.; Stine-Morrow, Elizabeth A. L.

2018-01-01

With advancing age, episodic memory performance shows marked declines along with concurrent reports of lower subjective memory beliefs. Given that normative age-related declines in episodic memory co-occur with declines in other cognitive domains, we examined the relationship between memory beliefs and multiple domains of cognitive functioning. Confirmatory bi-factor structural equation models were used to parse the shared and independent variance among factors representing episodic memory, psychomotor speed, and executive reasoning in one large cohort study (Senior Odyssey, N = 462), and replicated using another large cohort of healthy older adults (ACTIVE, N = 2,802). Accounting for a general fluid cognitive functioning factor (comprised of the shared variance among measures of episodic memory, speed, and reasoning) attenuated the relationship between objective memory performance and subjective memory beliefs in both samples. Moreover, the general cognitive functioning factor was the strongest predictor of memory beliefs in both samples. These findings are consistent with the notion that dispositional memory beliefs may reflect perceptions of cognition more broadly. This may be one reason why memory beliefs have broad predictive validity for interventions that target fluid cognitive ability. PMID:27685541
Decomposing the relationship between cognitive functioning and self-referent memory beliefs in older adulthood: what's memory got to do with it?

PubMed

Payne, Brennan R; Gross, Alden L; Hill, Patrick L; Parisi, Jeanine M; Rebok, George W; Stine-Morrow, Elizabeth A L

2017-07-01

With advancing age, episodic memory performance shows marked declines along with concurrent reports of lower subjective memory beliefs. Given that normative age-related declines in episodic memory co-occur with declines in other cognitive domains, we examined the relationship between memory beliefs and multiple domains of cognitive functioning. Confirmatory bi-factor structural equation models were used to parse the shared and independent variance among factors representing episodic memory, psychomotor speed, and executive reasoning in one large cohort study (Senior Odyssey, N = 462), and replicated using another large cohort of healthy older adults (ACTIVE, N = 2802). Accounting for a general fluid cognitive functioning factor (comprised of the shared variance among measures of episodic memory, speed, and reasoning) attenuated the relationship between objective memory performance and subjective memory beliefs in both samples. Moreover, the general cognitive functioning factor was the strongest predictor of memory beliefs in both samples. These findings are consistent with the notion that dispositional memory beliefs may reflect perceptions of cognition more broadly. This may be one reason why memory beliefs have broad predictive validity for interventions that target fluid cognitive ability.
Exploiting NASA's Cumulus Earth Science Cloud Archive with Services and Computation

NASA Astrophysics Data System (ADS)

Pilone, D.; Quinn, P.; Jazayeri, A.; Schuler, I.; Plofchan, P.; Baynes, K.; Ramachandran, R.

2017-12-01

NASA's Earth Observing System Data and Information System (EOSDIS) houses nearly 30PBs of critical Earth Science data and with upcoming missions is expected to balloon to between 200PBs-300PBs over the next seven years. In addition to the massive increase in data collected, researchers and application developers want more and faster access - enabling complex visualizations, long time-series analysis, and cross dataset research without needing to copy and manage massive amounts of data locally. NASA has started prototyping with commercial cloud providers to make this data available in elastic cloud compute environments, allowing application developers direct access to the massive EOSDIS holdings. In this talk we'll explain the principles behind the archive architecture and share our experience of dealing with large amounts of data with serverless architectures including AWS Lambda, the Elastic Container Service (ECS) for long running jobs, and why we dropped thousands of lines of code for AWS Step Functions. We'll discuss best practices and patterns for accessing and using data available in a shared object store (S3) and leveraging events and message passing for sophisticated and highly scalable processing and analysis workflows. Finally we'll share capabilities NASA and cloud services are making available on the archives to enable massively scalable analysis and computation in a variety of formats and tools.
Method for prefetching non-contiguous data structures

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Brewster, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Takken, Todd E [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

2009-05-05

A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple perfecting for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefect rather than some other predictive algorithm. This enables hardware to effectively prefect memory access patterns that are non-contiguous, but repetitive.
Unconditional polarization qubit quantum memory at room temperature

NASA Astrophysics Data System (ADS)

Namazi, Mehdi; Kupchak, Connor; Jordaan, Bertus; Shahrokhshahi, Reihaneh; Figueroa, Eden

2016-05-01

The creation of global quantum key distribution and quantum communication networks requires multiple operational quantum memories. Achieving a considerable reduction in experimental and cost overhead in these implementations is thus a major challenge. Here we present a polarization qubit quantum memory fully-operational at 330K, an unheard frontier in the development of useful qubit quantum technology. This result is achieved through extensive study of how optical response of cold atomic medium is transformed by the motion of atoms at room temperature leading to an optimal characterization of room temperature quantum light-matter interfaces. Our quantum memory shows an average fidelity of 86.6 +/- 0.6% for optical pulses containing on average 1 photon per pulse, thereby defeating any classical strategy exploiting the non-unitary character of the memory efficiency. Our system significantly decreases the technological overhead required to achieve quantum memory operation and will serve as a building block for scalable and technologically simpler many-memory quantum machines. The work was supported by the US-Navy Office of Naval Research, Grant Number N00141410801 and the Simons Foundation, Grant Number SBF241180. B. J. acknowledges financial assistance of the National Research Foundation (NRF) of South Africa.
Cooperative Data Sharing: Simple Support for Clusters of SMP Nodes

NASA Technical Reports Server (NTRS)

DiNucci, David C.; Balley, David H. (Technical Monitor)

1997-01-01

Libraries like PVM and MPI send typed messages to allow for heterogeneous cluster computing. Lower-level libraries, such as GAM, provide more efficient access to communication by removing the need to copy messages between the interface and user space in some cases. still lower-level interfaces, such as UNET, get right down to the hardware level to provide maximum performance. However, these are all still interfaces for passing messages from one process to another, and have limited utility in a shared-memory environment, due primarily to the fact that message passing is just another term for copying. This drawback is made more pertinent by today's hybrid architectures (e.g. clusters of SMPs), where it is difficult to know beforehand whether two communicating processes will share memory. As a result, even portable language tools (like HPF compilers) must either map all interprocess communication, into message passing with the accompanying performance degradation in shared memory environments, or they must check each communication at run-time and implement the shared-memory case separately for efficiency. Cooperative Data Sharing (CDS) is a single user-level API which abstracts all communication between processes into the sharing and access coordination of memory regions, in a model which might be described as "distributed shared messages" or "large-grain distributed shared memory". As a result, the user programs to a simple latency-tolerant abstract communication specification which can be mapped efficiently to either a shared-memory or message-passing based run-time system, depending upon the available architecture. Unlike some distributed shared memory interfaces, the user still has complete control over the assignment of data to processors, the forwarding of data to its next likely destination, and the queuing of data until it is needed, so even the relatively high latency present in clusters can be accomodated. CDS does not require special use of an MMU, which can add overhead to some DSM systems, and does not require an SPMD programming model. unlike some message-passing interfaces, CDS allows the user to implement efficient demand-driven applications where processes must "fight" over data, and does not perform copying if processes share memory and do not attempt concurrent writes. CDS also supports heterogeneous computing, dynamic process creation, handlers, and a very simple thread-arbitration mechanism. Additional support for array subsections is currently being considered. The CDS1 API, which forms the kernel of CDS, is built primarily upon only 2 communication primitives, one process initiation primitive, and some data translation (and marshalling) routines, memory allocation routines, and priority control routines. The entire current collection of 28 routines provides enough functionality to implement most (or all) of MPI 1 and 2, which has a much larger interface consisting of hundreds of routines. still, the API is small enough to consider integrating into standard os interfaces for handling inter-process communication in a network-independent way. This approach would also help to solve many of the problems plaguing other higher-level standards such as MPI and PVM which must, in some cases, "play OS" to adequately address progress and process control issues. The CDS2 API, a higher level of interface roughly equivalent in functionality to MPI and to be built entirely upon CDS1, is still being designed. It is intended to add support for the equivalent of communicators, reduction and other collective operations, process topologies, additional support for process creation, and some automatic memory management. CDS2 will not exactly match MPI, because the copy-free semantics of communication from CDS1 will be supported. CDS2 application programs will be free to carefully also use CDS1. CDS1 has been implemented on networks of workstations running unmodified Unix-based operating systems, using UDP/IP and vendor-supplied high- performance locks. Although its inter-node performance is currently unimpressive due to rudimentary implementation technique, it even now outperforms highly-optimized MPI implementation on intra-node communication due to its support for non-copy communication. The similarity of the CDS1 architecture to that of other projects such as UNET and TRAP suggests that the inter-node performance can be increased significantly to surpass MPI or PVM, and it may be possible to migrate some of its functionality to communication controllers.
A shared resource between declarative memory and motor memory.

PubMed

Keisler, Aysha; Shadmehr, Reza

2010-11-03

The neural systems that support motor adaptation in humans are thought to be distinct from those that support the declarative system. Yet, during motor adaptation changes in motor commands are supported by a fast adaptive process that has important properties (rapid learning, fast decay) that are usually associated with the declarative system. The fast process can be contrasted to a slow adaptive process that also supports motor memory, but learns gradually and shows resistance to forgetting. Here we show that after people stop performing a motor task, the fast motor memory can be disrupted by a task that engages declarative memory, but the slow motor memory is immune from this interference. Furthermore, we find that the fast/declarative component plays a major role in the consolidation of the slow motor memory. Because of the competitive nature of declarative and nondeclarative memory during consolidation, impairment of the fast/declarative component leads to improvements in the slow/nondeclarative component. Therefore, the fast process that supports formation of motor memory is not only neurally distinct from the slow process, but it shares critical resources with the declarative memory system.
A shared resource between declarative memory and motor memory

PubMed Central

Keisler, Aysha; Shadmehr, Reza

2010-01-01

The neural systems that support motor adaptation in humans are thought to be distinct from those that support the declarative system. Yet, during motor adaptation changes in motor commands are supported by a fast adaptive process that has important properties (rapid learning, fast decay) that are usually associated with the declarative system. The fast process can be contrasted to a slow adaptive process that also supports motor memory, but learns gradually and shows resistance to forgetting. Here we show that after people stop performing a motor task, the fast motor memory can be disrupted by a task that engages declarative memory, but the slow motor memory is immune from this interference. Furthermore, we find that the fast/declarative component plays a major role in the consolidation of the slow motor memory. Because of the competitive nature of declarative and non-declarative memory during consolidation, impairment of the fast/declarative component leads to improvements in the slow/non-declarative component. Therefore, the fast process that supports formation of motor memory is not only neurally distinct from the slow process, but it shares critical resources with the declarative memory system. PMID:21048140
Discrete-Slots Models of Visual Working-Memory Response Times

PubMed Central

Donkin, Christopher; Nosofsky, Robert M.; Gold, Jason M.; Shiffrin, Richard M.

2014-01-01

Much recent research has aimed to establish whether visual working memory (WM) is better characterized by a limited number of discrete all-or-none slots or by a continuous sharing of memory resources. To date, however, researchers have not considered the response-time (RT) predictions of discrete-slots versus shared-resources models. To complement the past research in this field, we formalize a family of mixed-state, discrete-slots models for explaining choice and RTs in tasks of visual WM change detection. In the tasks under investigation, a small set of visual items is presented, followed by a test item in 1 of the studied positions for which a change judgment must be made. According to the models, if the studied item in that position is retained in 1 of the discrete slots, then a memory-based evidence-accumulation process determines the choice and the RT; if the studied item in that position is missing, then a guessing-based accumulation process operates. Observed RT distributions are therefore theorized to arise as probabilistic mixtures of the memory-based and guessing distributions. We formalize an analogous set of continuous shared-resources models. The model classes are tested on individual subjects with both qualitative contrasts and quantitative fits to RT-distribution data. The discrete-slots models provide much better qualitative and quantitative accounts of the RT and choice data than do the shared-resources models, although there is some evidence for “slots plus resources” when memory set size is very small. PMID:24015956
Shared Representations in Language Processing and Verbal Short-Term Memory: The Case of Grammatical Gender

ERIC Educational Resources Information Center

Schweppe, Judith; Rummer, Ralf

2007-01-01

The general idea of language-based accounts of short-term memory is that retention of linguistic materials is based on representations within the language processing system. In the present sentence recall study, we address the question whether the assumption of shared representations holds for morphosyntactic information (here: grammatical gender…
The Precategorical Nature of Visual Short-Term Memory

ERIC Educational Resources Information Center

Quinlan, Philip T.; Cohen, Dale J.

2016-01-01

We conducted a series of recognition experiments that assessed whether visual short-term memory (VSTM) is sensitive to shared category membership of to-be-remembered (tbr) images of common objects. In Experiment 1 some of the tbr items shared the same basic level category (e.g., hand axe): Such items were no better retained than others. In the…
Fault tolerant onboard packet switch architecture for communication satellites: Shared memory per beam approach

NASA Technical Reports Server (NTRS)

Shalkhauser, Mary JO; Quintana, Jorge A.; Soni, Nitin J.

1994-01-01

The NASA Lewis Research Center is developing a multichannel communication signal processing satellite (MCSPS) system which will provide low data rate, direct to user, commercial communications services. The focus of current space segment developments is a flexible, high-throughput, fault tolerant onboard information switching processor. This information switching processor (ISP) is a destination-directed packet switch which performs both space and time switching to route user information among numerous user ground terminals. Through both industry study contracts and in-house investigations, several packet switching architectures were examined. A contention-free approach, the shared memory per beam architecture, was selected for implementation. The shared memory per beam architecture, fault tolerance insertion, implementation, and demonstration plans are described.
The performance of disk arrays in shared-memory database machines

NASA Technical Reports Server (NTRS)

Katz, Randy H.; Hong, Wei

1993-01-01

In this paper, we examine how disk arrays and shared memory multiprocessors lead to an effective method for constructing database machines for general-purpose complex query processing. We show that disk arrays can lead to cost-effective storage systems if they are configured from suitably small formfactor disk drives. We introduce the storage system metric data temperature as a way to evaluate how well a disk configuration can sustain its workload, and we show that disk arrays can sustain the same data temperature as a more expensive mirrored-disk configuration. We use the metric to evaluate the performance of disk arrays in XPRS, an operational shared-memory multiprocessor database system being developed at the University of California, Berkeley.
Enhancing Scalability and Efficiency of the TOUGH2_MP for LinuxClusters

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Keni; Wu, Yu-Shu

2006-04-17

TOUGH2{_}MP, the parallel version TOUGH2 code, has been enhanced by implementing more efficient communication schemes. This enhancement is achieved through reducing the amount of small-size messages and the volume of large messages. The message exchange speed is further improved by using non-blocking communications for both linear and nonlinear iterations. In addition, we have modified the AZTEC parallel linear-equation solver to nonblocking communication. Through the improvement of code structuring and bug fixing, the new version code is now more stable, while demonstrating similar or even better nonlinear iteration converging speed than the original TOUGH2 code. As a result, the new versionmore » of TOUGH2{_}MP is improved significantly in its efficiency. In this paper, the scalability and efficiency of the parallel code are demonstrated by solving two large-scale problems. The testing results indicate that speedup of the code may depend on both problem size and complexity. In general, the code has excellent scalability in memory requirement as well as computing time.« less
Scalable metagenomic taxonomy classification using a reference genome database

PubMed Central

Ames, Sasha K.; Hysom, David A.; Gardner, Shea N.; Lloyd, G. Scott; Gokhale, Maya B.; Allen, Jonathan E.

2013-01-01

Motivation: Deep metagenomic sequencing of biological samples has the potential to recover otherwise difficult-to-detect microorganisms and accurately characterize biological samples with limited prior knowledge of sample contents. Existing metagenomic taxonomic classification algorithms, however, do not scale well to analyze large metagenomic datasets, and balancing classification accuracy with computational efficiency presents a fundamental challenge. Results: A method is presented to shift computational costs to an off-line computation by creating a taxonomy/genome index that supports scalable metagenomic classification. Scalable performance is demonstrated on real and simulated data to show accurate classification in the presence of novel organisms on samples that include viruses, prokaryotes, fungi and protists. Taxonomic classification of the previously published 150 giga-base Tyrolean Iceman dataset was found to take <20 h on a single node 40 core large memory machine and provide new insights on the metagenomic contents of the sample. Availability: Software was implemented in C++ and is freely available at http://sourceforge.net/projects/lmat Contact: allen99@llnl.gov Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23828782
Optical memories in digital computing

NASA Technical Reports Server (NTRS)

Alford, C. O.; Gaylord, T. K.

1979-01-01

High capacity optical memories with relatively-high data-transfer rate and multiport simultaneous access capability may serve as basis for new computer architectures. Several computer structures that might profitably use memories are: a) simultaneous record-access system, b) simultaneously-shared memory computer system, and c) parallel digital processing structure.
Experimental evaluation of a flexible I/O architecture for accelerating workflow engines in ultrascale environments

DOE PAGES

Duro, Francisco Rodrigo; Blas, Javier Garcia; Isaila, Florin; ...

2016-10-06

The increasing volume of scientific data and the limited scalability and performance of storage systems are currently presenting a significant limitation for the productivity of the scientific workflows running on both high-performance computing (HPC) and cloud platforms. Clearly needed is better integration of storage systems and workflow engines to address this problem. This paper presents and evaluates a novel solution that leverages codesign principles for integrating Hercules—an in-memory data store—with a workflow management system. We consider four main aspects: workflow representation, task scheduling, task placement, and task termination. As a result, the experimental evaluation on both cloud and HPC systemsmore » demonstrates significant performance and scalability improvements over existing state-of-the-art approaches.« less
A Scalable Nonuniform Pointer Analysis for Embedded Program

NASA Technical Reports Server (NTRS)

Venet, Arnaud

2004-01-01

In this paper we present a scalable pointer analysis for embedded applications that is able to distinguish between instances of recursively defined data structures and elements of arrays. The main contribution consists of an efficient yet precise algorithm that can handle multithreaded programs. We first perform an inexpensive flow-sensitive analysis of each function in the program that generates semantic equations describing the effect of the function on the memory graph. These equations bear numerical constraints that describe nonuniform points-to relationships. We then iteratively solve these equations in order to obtain an abstract storage graph that describes the shape of data structures at every point of the program for all possible thread interleavings. We bring experimental evidence that this approach is tractable and precise for real-size embedded applications.
Reader set encoding for directory of shared cache memory in multiprocessor system

DOEpatents

Ahn, Dnaiel; Ceze, Luis H.; Gara, Alan; Ohmacht, Martin; Xiaotong, Zhuang

2014-06-10

In a parallel processing system with speculative execution, conflict checking occurs in a directory lookup of a cache memory that is shared by all processors. In each case, the same physical memory address will map to the same set of that cache, no matter which processor originated that access. The directory includes a dynamic reader set encoding, indicating what speculative threads have read a particular line. This reader set encoding is used in conflict checking. A bitset encoding is used to specify particular threads that have read the line.
Insights on consciousness from taste memory research.

PubMed

Gallo, Milagros

2016-01-01

Taste research in rodents supports the relevance of memory in order to determine the content of consciousness by modifying both taste perception and later action. Associated with this issue is the fact that taste and visual modalities share anatomical circuits traditionally related to conscious memory. This challenges the view of taste memory as a type of non-declarative unconscious memory.

Technology breakthroughs in high performance metal-oxide-semiconductor devices for ultra-high density, low power non-volatile memory applications

NASA Astrophysics Data System (ADS)

Hong, Augustin Jinwoo

Non-volatile memory devices have attracted much attention because data can be retained without power consumption more than a decade. Therefore, non-volatile memory devices are essential to mobile electronic applications. Among state of the art non-volatile memory devices, NAND flash memory has earned the highest attention because of its ultra-high scalability and therefore its ultra-high storage capacity. However, human desire as well as market competition requires not only larger storage capacity but also lower power consumption for longer battery life time. One way to meet this human desire and extend the benefits of NAND flash memory is finding out new materials for storage layer inside the flash memory, which is called floating gate in the state of the art flash memory device. In this dissertation, we study new materials for the floating gate that can lower down the power consumption and increase the storage capacity at the same time. To this end, we employ various materials such as metal nanodot, metal thin film and graphene incorporating complementary-metal-oxide-semiconductor (CMOS) compatible processes. Experimental results show excellent memory effects at relatively low operating voltages. Detailed physics and analysis on experimental results are discussed. These new materials for data storage can be promising candidates for future non-volatile memory application beyond the state of the art flash technologies.
A compact superconducting nanowire memory element operated by nanowire cryotrons

NASA Astrophysics Data System (ADS)

Zhao, Qing-Yuan; Toomey, Emily A.; Butters, Brenden A.; McCaughan, Adam N.; Dane, Andrew E.; Nam, Sae-Woo; Berggren, Karl K.

2018-07-01

A superconducting loop stores persistent current without any ohmic loss, making it an ideal platform for energy efficient memories. Conventional superconducting memories use an architecture based on Josephson junctions (JJs) and have demonstrated access times less than 10 ps and power dissipation as low as 10-19 J. However, their scalability has been slow to develop due to the challenges in reducing the dimensions of JJs and minimizing the area of the superconducting loops. In addition to the memory itself, complex readout circuits require additional JJs and inductors for coupling signals, increasing the overall area. Here, we have demonstrated a superconducting memory based solely on lithographic nanowires. The small dimensions of the nanowire ensure that the device can be fabricated in a dense area in multiple layers, while the high kinetic inductance makes the loop essentially independent of geometric inductance, allowing it to be scaled down without sacrificing performance. The memory is operated by a group of nanowire cryotrons patterned alongside the storage loop, enabling us to reduce the entire memory cell to 3 μm × 7 μm in our proof-of-concept device. In this work we present the operation principles of a superconducting nanowire memory (nMem) and characterize its bit error rate, speed, and power dissipation.
An i2b2-based, generalizable, open source, self-scaling chronic disease registry

PubMed Central

Quan, Justin; Ortiz, David M; Bousvaros, Athos; Ilowite, Norman T; Inman, Christi J; Marsolo, Keith; McMurry, Andrew J; Sandborg, Christy I; Schanberg, Laura E; Wallace, Carol A; Warren, Robert W; Weber, Griffin M; Mandl, Kenneth D

2013-01-01

Objective Registries are a well-established mechanism for obtaining high quality, disease-specific data, but are often highly project-specific in their design, implementation, and policies for data use. In contrast to the conventional model of centralized data contribution, warehousing, and control, we design a self-scaling registry technology for collaborative data sharing, based upon the widely adopted Integrating Biology & the Bedside (i2b2) data warehousing framework and the Shared Health Research Information Network (SHRINE) peer-to-peer networking software. Materials and methods Focusing our design around creation of a scalable solution for collaboration within multi-site disease registries, we leverage the i2b2 and SHRINE open source software to create a modular, ontology-based, federated infrastructure that provides research investigators full ownership and access to their contributed data while supporting permissioned yet robust data sharing. We accomplish these objectives via web services supporting peer-group overlays, group-aware data aggregation, and administrative functions. Results The 56-site Childhood Arthritis & Rheumatology Research Alliance (CARRA) Registry and 3-site Harvard Inflammatory Bowel Diseases Longitudinal Data Repository now utilize i2b2 self-scaling registry technology (i2b2-SSR). This platform, extensible to federation of multiple projects within and between research networks, encompasses >6000 subjects at sites throughout the USA. Discussion We utilize the i2b2-SSR platform to minimize technical barriers to collaboration while enabling fine-grained control over data sharing. Conclusions The implementation of i2b2-SSR for the multi-site, multi-stakeholder CARRA Registry has established a digital infrastructure for community-driven research data sharing in pediatric rheumatology in the USA. We envision i2b2-SSR as a scalable, reusable solution facilitating interdisciplinary research across diseases. PMID:22733975
An i2b2-based, generalizable, open source, self-scaling chronic disease registry.

PubMed

Natter, Marc D; Quan, Justin; Ortiz, David M; Bousvaros, Athos; Ilowite, Norman T; Inman, Christi J; Marsolo, Keith; McMurry, Andrew J; Sandborg, Christy I; Schanberg, Laura E; Wallace, Carol A; Warren, Robert W; Weber, Griffin M; Mandl, Kenneth D

2013-01-01

Registries are a well-established mechanism for obtaining high quality, disease-specific data, but are often highly project-specific in their design, implementation, and policies for data use. In contrast to the conventional model of centralized data contribution, warehousing, and control, we design a self-scaling registry technology for collaborative data sharing, based upon the widely adopted Integrating Biology & the Bedside (i2b2) data warehousing framework and the Shared Health Research Information Network (SHRINE) peer-to-peer networking software. Focusing our design around creation of a scalable solution for collaboration within multi-site disease registries, we leverage the i2b2 and SHRINE open source software to create a modular, ontology-based, federated infrastructure that provides research investigators full ownership and access to their contributed data while supporting permissioned yet robust data sharing. We accomplish these objectives via web services supporting peer-group overlays, group-aware data aggregation, and administrative functions. The 56-site Childhood Arthritis & Rheumatology Research Alliance (CARRA) Registry and 3-site Harvard Inflammatory Bowel Diseases Longitudinal Data Repository now utilize i2b2 self-scaling registry technology (i2b2-SSR). This platform, extensible to federation of multiple projects within and between research networks, encompasses >6000 subjects at sites throughout the USA. We utilize the i2b2-SSR platform to minimize technical barriers to collaboration while enabling fine-grained control over data sharing. The implementation of i2b2-SSR for the multi-site, multi-stakeholder CARRA Registry has established a digital infrastructure for community-driven research data sharing in pediatric rheumatology in the USA. We envision i2b2-SSR as a scalable, reusable solution facilitating interdisciplinary research across diseases.
Genomically Encoded Analog Memory with Precise In vivo DNA Writing in Living Cell Populations

PubMed Central

Farzadfard, Fahim; Lu, Timothy K.

2014-01-01

Cellular memory is crucial to many natural biological processes and for sophisticated synthetic-biology applications. Existing cellular memories rely on epigenetic switches or recombinases, which are limited in scalability and recording capacity. Here, we use the DNA of living cell populations as genomic ‘tape recorders’ for the analog and distributed recording of long-term event histories. We describe a platform for generating single-stranded DNA (ssDNA) in vivo in response to arbitrary transcriptional signals. When co-expressed with a recombinase, these intracellularly expressed ssDNAs target specific genomic DNA addresses, resulting in precise mutations that accumulate in cell populations as a function of the magnitude and duration of the inputs. This platform could enable long-term cellular recorders for environmental and biomedical applications, biological state machines, and enhanced genome engineering strategies. PMID:25395541
Investigating Ground Swarm Robotics Using Agent Based Simulation

DTIC Science & Technology

2006-12-01

Incorporation of virtual pheromones as a shared memory map is modeled as an additional capability that is found to enhance the robustness and reliability of the...virtual pheromones as a shared memory map is modeled as an additional capability that is found to enhance the robustness and reliability of the swarm... PHEROMONES .......................................... 42 1. Repel Friends under Inorganic SA.................................................. 45 2. Max
Improving the Scalability of an Exact Approach for Frequent Item Set Hiding

ERIC Educational Resources Information Center

LaMacchia, Carolyn

2013-01-01

Technological advances have led to the generation of large databases of organizational data recognized as an information-rich, strategic asset for internal analysis and sharing with trading partners. Data mining techniques can discover patterns in large databases including relationships considered strategically relevant to the owner of the data.…
Selection and Presentation of Commercially Available Electronic Resources: Issues and Practices.

ERIC Educational Resources Information Center

Jewell, Timothy D.

This report focuses on practices related to the selection and presentation of commercially available electronic resources. As part of the Digital Library Federation's Collection Practices Initiative, the report also shares the goal of identifying and propagating practices that support the growth of sustainable and scalable collections. It looks in…
Gigwa-Genotype investigator for genome-wide analyses.

PubMed

Sempéré, Guilhem; Philippe, Florian; Dereeper, Alexis; Ruiz, Manuel; Sarah, Gautier; Larmande, Pierre

2016-06-06

Exploring the structure of genomes and analyzing their evolution is essential to understanding the ecological adaptation of organisms. However, with the large amounts of data being produced by next-generation sequencing, computational challenges arise in terms of storage, search, sharing, analysis and visualization. This is particularly true with regards to studies of genomic variation, which are currently lacking scalable and user-friendly data exploration solutions. Here we present Gigwa, a web-based tool that provides an easy and intuitive way to explore large amounts of genotyping data by filtering it not only on the basis of variant features, including functional annotations, but also on genotype patterns. The data storage relies on MongoDB, which offers good scalability properties. Gigwa can handle multiple databases and may be deployed in either single- or multi-user mode. In addition, it provides a wide range of popular export formats. The Gigwa application is suitable for managing large amounts of genomic variation data. Its user-friendly web interface makes such processing widely accessible. It can either be simply deployed on a workstation or be used to provide a shared data portal for a given community of researchers.
Centrally managed unified shared virtual address space

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wilkes, John

Systems, apparatuses, and methods for managing a unified shared virtual address space. A host may execute system software and manage a plurality of nodes coupled to the host. The host may send work tasks to the nodes, and for each node, the host may externally manage the node's view of the system's virtual address space. Each node may have a central processing unit (CPU) style memory management unit (MMU) with an internal translation lookaside buffer (TLB). In one embodiment, the host may be coupled to a given node via an input/output memory management unit (IOMMU) interface, where the IOMMU frontendmore » interface shares the TLB with the given node's MMU. In another embodiment, the host may control the given node's view of virtual address space via memory-mapped control registers.« less
Attention and Visuospatial Working Memory Share the Same Processing Resources

PubMed Central

Feng, Jing; Pratt, Jay; Spence, Ian

2012-01-01

Attention and visuospatial working memory (VWM) share very similar characteristics; both have the same upper bound of about four items in capacity and they recruit overlapping brain regions. We examined whether both attention and VWM share the same processing resources using a novel dual-task costs approach based on a load-varying dual-task technique. With sufficiently large loads on attention and VWM, considerable interference between the two processes was observed. A further load increase on either process produced reciprocal increases in interference on both processes, indicating that attention and VWM share common resources. More critically, comparison among four experiments on the reciprocal interference effects, as measured by the dual-task costs, demonstrates no significant contribution from additional processing other than the shared processes. These results support the notion that attention and VWM share the same processing resources. PMID:22529826
Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

DOE PAGES

Carter Edwards, H.; Trott, Christian R.; Sunderland, Daniel

2014-07-22

The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. We found that a major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diversemore » manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. Furthermore, the Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.« less
Storage of multiple single-photon pulses emitted from a quantum dot in a solid-state quantum memory.

PubMed

Tang, Jian-Shun; Zhou, Zong-Quan; Wang, Yi-Tao; Li, Yu-Long; Liu, Xiao; Hua, Yi-Lin; Zou, Yang; Wang, Shuang; He, De-Yong; Chen, Geng; Sun, Yong-Nan; Yu, Ying; Li, Mi-Feng; Zha, Guo-Wei; Ni, Hai-Qiao; Niu, Zhi-Chuan; Li, Chuan-Feng; Guo, Guang-Can

2015-10-15

Quantum repeaters are critical components for distributing entanglement over long distances in presence of unavoidable optical losses during transmission. Stimulated by the Duan-Lukin-Cirac-Zoller protocol, many improved quantum repeater protocols based on quantum memories have been proposed, which commonly focus on the entanglement-distribution rate. Among these protocols, the elimination of multiple photons (or multiple photon-pairs) and the use of multimode quantum memory are demonstrated to have the ability to greatly improve the entanglement-distribution rate. Here, we demonstrate the storage of deterministic single photons emitted from a quantum dot in a polarization-maintaining solid-state quantum memory; in addition, multi-temporal-mode memory with 1, 20 and 100 narrow single-photon pulses is also demonstrated. Multi-photons are eliminated, and only one photon at most is contained in each pulse. Moreover, the solid-state properties of both sub-systems make this configuration more stable and easier to be scalable. Our work will be helpful in the construction of efficient quantum repeaters based on all-solid-state devices.
Solid State Spin-Wave Quantum Memory for Time-Bin Qubits.

PubMed

Gündoğan, Mustafa; Ledingham, Patrick M; Kutluer, Kutlu; Mazzera, Margherita; de Riedmatten, Hugues

2015-06-12

We demonstrate the first solid-state spin-wave optical quantum memory with on-demand read-out. Using the full atomic frequency comb scheme in a Pr(3+):Y2SiO5 crystal, we store weak coherent pulses at the single-photon level with a signal-to-noise ratio >10. Narrow-band spectral filtering based on spectral hole burning in a second Pr(3+):Y2SiO5 crystal is used to filter out the excess noise created by control pulses to reach an unconditional noise level of (2.0±0.3)×10(-3) photons per pulse. We also report spin-wave storage of photonic time-bin qubits with conditional fidelities higher than achievable by a measure and prepare strategy, demonstrating that the spin-wave memory operates in the quantum regime. This makes our device the first demonstration of a quantum memory for time-bin qubits, with on-demand read-out of the stored quantum information. These results represent an important step for the use of solid-state quantum memories in scalable quantum networks.
Storage of multiple single-photon pulses emitted from a quantum dot in a solid-state quantum memory

PubMed Central

Tang, Jian-Shun; Zhou, Zong-Quan; Wang, Yi-Tao; Li, Yu-Long; Liu, Xiao; Hua, Yi-Lin; Zou, Yang; Wang, Shuang; He, De-Yong; Chen, Geng; Sun, Yong-Nan; Yu, Ying; Li, Mi-Feng; Zha, Guo-Wei; Ni, Hai-Qiao; Niu, Zhi-Chuan; Li, Chuan-Feng; Guo, Guang-Can

2015-01-01

Quantum repeaters are critical components for distributing entanglement over long distances in presence of unavoidable optical losses during transmission. Stimulated by the Duan–Lukin–Cirac–Zoller protocol, many improved quantum repeater protocols based on quantum memories have been proposed, which commonly focus on the entanglement-distribution rate. Among these protocols, the elimination of multiple photons (or multiple photon-pairs) and the use of multimode quantum memory are demonstrated to have the ability to greatly improve the entanglement-distribution rate. Here, we demonstrate the storage of deterministic single photons emitted from a quantum dot in a polarization-maintaining solid-state quantum memory; in addition, multi-temporal-mode memory with 1, 20 and 100 narrow single-photon pulses is also demonstrated. Multi-photons are eliminated, and only one photon at most is contained in each pulse. Moreover, the solid-state properties of both sub-systems make this configuration more stable and easier to be scalable. Our work will be helpful in the construction of efficient quantum repeaters based on all-solid-state devices. PMID:26468996
Contention Modeling for Multithreaded Distributed Shared Memory Machines: The Cray XMT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Secchi, Simone; Tumeo, Antonino; Villa, Oreste

Distributed Shared Memory (DSM) machines are a wide class of multi-processor computing systems where a large virtually-shared address space is mapped on a network of physically distributed memories. High memory latency and network contention are two of the main factors that limit performance scaling of such architectures. Modern high-performance computing DSM systems have evolved toward exploitation of massive hardware multi-threading and fine-grained memory hashing to tolerate irregular latencies, avoid network hot-spots and enable high scaling. In order to model the performance of such large-scale machines, parallel simulation has been proved to be a promising approach to achieve good accuracy inmore » reasonable times. One of the most critical factors in solving the simulation speed-accuracy trade-off is network modeling. The Cray XMT is a massively multi-threaded supercomputing architecture that belongs to the DSM class, since it implements a globally-shared address space abstraction on top of a physically distributed memory substrate. In this paper, we discuss the development of a contention-aware network model intended to be integrated in a full-system XMT simulator. We start by measuring the effects of network contention in a 128-processor XMT machine and then investigate the trade-off that exists between simulation accuracy and speed, by comparing three network models which operate at different levels of accuracy. The comparison and model validation is performed by executing a string-matching algorithm on the full-system simulator and on the XMT, using three datasets that generate noticeably different contention patterns.« less
SiGe epitaxial memory for neuromorphic computing with reproducible high performance based on engineered dislocations

NASA Astrophysics Data System (ADS)

Choi, Shinhyun; Tan, Scott H.; Li, Zefan; Kim, Yunjo; Choi, Chanyeol; Chen, Pai-Yu; Yeon, Hanwool; Yu, Shimeng; Kim, Jeehwan

2018-01-01

Although several types of architecture combining memory cells and transistors have been used to demonstrate artificial synaptic arrays, they usually present limited scalability and high power consumption. Transistor-free analog switching devices may overcome these limitations, yet the typical switching process they rely on—formation of filaments in an amorphous medium—is not easily controlled and hence hampers the spatial and temporal reproducibility of the performance. Here, we demonstrate analog resistive switching devices that possess desired characteristics for neuromorphic computing networks with minimal performance variations using a single-crystalline SiGe layer epitaxially grown on Si as a switching medium. Such epitaxial random access memories utilize threading dislocations in SiGe to confine metal filaments in a defined, one-dimensional channel. This confinement results in drastically enhanced switching uniformity and long retention/high endurance with a high analog on/off ratio. Simulations using the MNIST handwritten recognition data set prove that epitaxial random access memories can operate with an online learning accuracy of 95.1%.
System and method for memory allocation in a multiclass memory system

DOEpatents

Loh, Gabriel; Meswani, Mitesh; Ignatowski, Michael; Nutter, Mark

2016-06-28

A system for memory allocation in a multiclass memory system includes a processor coupleable to a plurality of memories sharing a unified memory address space, and a library store to store a library of software functions. The processor identifies a type of a data structure in response to a memory allocation function call to the library for allocating memory to the data structure. Using the library, the processor allocates portions of the data structure among multiple memories of the multiclass memory system based on the type of the data structure.
Nitrogen-doped partially reduced graphene oxide rewritable nonvolatile memory.

PubMed

Seo, Sohyeon; Yoon, Yeoheung; Lee, Junghyun; Park, Younghun; Lee, Hyoyoung

2013-04-23

As memory materials, two-dimensional (2D) carbon materials such as graphene oxide (GO)-based materials have attracted attention due to a variety of advantageous attributes, including their solution-processability and their potential for highly scalable device fabrication for transistor-based memory and cross-bar memory arrays. In spite of this, the use of GO-based materials has been limited, primarily due to uncontrollable oxygen functional groups. To induce the stable memory effect by ionic charges of a negatively charged carboxylic acid group of partially reduced graphene oxide (PrGO), a positively charged pyridinium N that served as a counterion to the negatively charged carboxylic acid was carefully introduced on the PrGO framework. Partially reduced N-doped graphene oxide (PrGODMF) in dimethylformamide (DMF) behaved as a semiconducting nonvolatile memory material. Its optical energy band gap was 1.7-2.1 eV and contained a sp2 C═C framework with 45-50% oxygen-functionalized carbon density and 3% doped nitrogen atoms. In particular, rewritable nonvolatile memory characteristics were dependent on the proportion of pyridinum N, and as the proportion of pyridinium N atom decreased, the PrGODMF film lost memory behavior. Polarization of charged PrGODMF containing pyridinium N and carboxylic acid under an electric field produced N-doped PrGODMF memory effects that followed voltage-driven rewrite-read-erase-read processes.
A Formal Model of Capacity Limits in Working Memory

ERIC Educational Resources Information Center

Oberauer, Klaus; Kliegl, Reinhold

2006-01-01

A mathematical model of working-memory capacity limits is proposed on the key assumption of mutual interference between items in working memory. Interference is assumed to arise from overwriting of features shared by these items. The model was fit to time-accuracy data of memory-updating tasks from four experiments using nonlinear mixed effect…

Ordering of guarded and unguarded stores for no-sync I/O

DOEpatents

Gara, Alan; Ohmacht, Martin

2013-06-25

A parallel computing system processes at least one store instruction. A first processor core issues a store instruction. A first queue, associated with the first processor core, stores the store instruction. A second queue, associated with a first local cache memory device of the first processor core, stores the store instruction. The first processor core updates first data in the first local cache memory device according to the store instruction. The third queue, associated with at least one shared cache memory device, stores the store instruction. The first processor core invalidates second data, associated with the store instruction, in the at least one shared cache memory. The first processor core invalidates third data, associated with the store instruction, in other local cache memory devices of other processor cores. The first processor core flushing only the first queue.
Explicit time integration of finite element models on a vectorized, concurrent computer with shared memory

NASA Technical Reports Server (NTRS)

Gilbertsen, Noreen D.; Belytschko, Ted

1990-01-01

The implementation of a nonlinear explicit program on a vectorized, concurrent computer with shared memory is described and studied. The conflict between vectorization and concurrency is described and some guidelines are given for optimal block sizes. Several example problems are summarized to illustrate the types of speed-ups which can be achieved by reprogramming as compared to compiler optimization.
LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kurzak, Jakub; Luszczek, Pitior; Faverge, Mathieu

2012-03-01

LU factorization with partial pivoting is a canonical numerical procedure and the main component of the High Performance LINPACK benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
Neural Mechanisms of Interference Control Underlie the Relationship between Fluid Intelligence and Working Memory Span

ERIC Educational Resources Information Center

Burgess, Gregory C.; Gray, Jeremy R.; Conway, Andrew R. A.; Braver, Todd S.

2011-01-01

Fluid intelligence (gF) and working memory (WM) span predict success in demanding cognitive situations. Recent studies show that much of the variance in gF and WM span is shared, suggesting common neural mechanisms. This study provides a direct investigation of the degree to which shared variance in gF and WM span can be explained by neural…
Emulating short-term synaptic dynamics with memristive devices

NASA Astrophysics Data System (ADS)

Berdan, Radu; Vasilaki, Eleni; Khiat, Ali; Indiveri, Giacomo; Serb, Alexandru; Prodromakis, Themistoklis

2016-01-01

Neuromorphic architectures offer great promise for achieving computation capacities beyond conventional Von Neumann machines. The essential elements for achieving this vision are highly scalable synaptic mimics that do not undermine biological fidelity. Here we demonstrate that single solid-state TiO2 memristors can exhibit non-associative plasticity phenomena observed in biological synapses, supported by their metastable memory state transition properties. We show that, contrary to conventional uses of solid-state memory, the existence of rate-limiting volatility is a key feature for capturing short-term synaptic dynamics. We also show how the temporal dynamics of our prototypes can be exploited to implement spatio-temporal computation, demonstrating the memristors full potential for building biophysically realistic neural processing systems.
KITTEN Lightweight Kernel 0.1 Beta

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pedretti, Kevin; Levenhagen, Michael; Kelly, Suzanne

2007-12-12

The Kitten Lightweight Kernel is a simplified OS (operating system) kernel that is intended to manage a compute node's hardware resources. It provides a set of mechanisms to user-level applications for utilizing hardware resources (e.g., allocating memory, creating processes, accessing the network). Kitten is much simpler than general-purpose OS kernels, such as Linux or Windows, but includes all of the esssential functionality needed to support HPC (high-performance computing) MPI, PGAS and OpenMP applications. Kitten provides unique capabilities such as physically contiguous application memory, transparent large page support, and noise-free tick-less operation, which enable HPC applications to obtain greater efficiency andmore » scalability than with general purpose OS kernels.« less
GraphReduce: Processing Large-Scale Graphs on Accelerator-Based Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sengupta, Dipanjan; Song, Shuaiwen; Agarwal, Kapil

2015-11-15

Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device’s internal memory capacity. GraphReduce adopts a combination of edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the host andmore » device.« less
What Multilevel Parallel Programs do when you are not Watching: A Performance Analysis Case Study Comparing MPI/OpenMP, MLP, and Nested OpenMP

NASA Technical Reports Server (NTRS)

Jost, Gabriele; Labarta, Jesus; Gimenez, Judit

2004-01-01

With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors, parallel programming techniques have evolved that support parallelism beyond a single level. When comparing the performance of applications based on different programming paradigms, it is important to differentiate between the influence of the programming model itself and other factors, such as implementation specific behavior of the operating system (OS) or architectural issues. Rewriting-a large scientific application in order to employ a new programming paradigms is usually a time consuming and error prone task. Before embarking on such an endeavor it is important to determine that there is really a gain that would not be possible with the current implementation. A detailed performance analysis is crucial to clarify these issues. The multilevel programming paradigms considered in this study are hybrid MPI/OpenMP, MLP, and nested OpenMP. The hybrid MPI/OpenMP approach is based on using MPI [7] for the coarse grained parallelization and OpenMP [9] for fine grained loop level parallelism. The MPI programming paradigm assumes a private address space for each process. Data is transferred by explicitly exchanging messages via calls to the MPI library. This model was originally designed for distributed memory architectures but is also suitable for shared memory systems. The second paradigm under consideration is MLP which was developed by Taft. The approach is similar to MPi/OpenMP, using a mix of coarse grain process level parallelization and loop level OpenMP parallelization. As it is the case with MPI, a private address space is assumed for each process. The MLP approach was developed for ccNUMA architectures and explicitly takes advantage of the availability of shared memory. A shared memory arena which is accessible by all processes is required. Communication is done by reading from and writing to the shared memory.
Understanding Graphics on a Scalable Latching Assistive Haptic Display Using a Shape Memory Polymer Membrane.

PubMed

Besse, Nadine; Rosset, Samuel; Zarate, Juan Jose; Ferrari, Elisabetta; Brayda, Luca; Shea, Herbert

2018-01-01

We present a fully latching and scalable 4 × 4 haptic display with 4 mm pitch, 5 s refresh time, 400 mN holding force, and 650 μm displacement per taxel. The display serves to convey dynamic graphical information to blind and visually impaired users. Combining significant holding force with high taxel density and large amplitude motion in a very compact overall form factor was made possible by exploiting the reversible, fast, hundred-fold change in the stiffness of a thin shape memory polymer (SMP) membrane when heated above its glass transition temperature. Local heating is produced using an addressable array of 3 mm in diameter stretchable microheaters patterned on the SMP. Each taxel is selectively and independently actuated by synchronizing the local Joule heating with a single pressure supply. Switching off the heating locks each taxel into its position (up or down), enabling holding any array configuration with zero power consumption. A 3D-printed pin array is mounted over the SMP membrane, providing the user with a smooth and room temperature array of movable pins to explore by touch. Perception tests were carried out with 24 blind users resulting in 70 percent correct pattern recognition over a 12-word tactile dictionary.
Audience-tuning effects on memory: the role of shared reality.

PubMed

Echterhoff, Gerald; Higgins, E Tory; Groll, Stephan

2005-09-01

After tuning to an audience, communicators' own memories for the topic often reflect the biased view expressed in their messages. Three studies examined explanations for this bias. Memories for a target person were biased when feedback signaled the audience's successful identification of the target but not after failed identification (Experiment 1). Whereas communicators tuning to an in-group audience exhibited the bias, communicators tuning to an out-group audience did not (Experiment 2). These differences did not depend on communicators' mood but were mediated by communicators' trust in their audience's judgment about other people (Experiments 2 and 3). Message and memory were more closely associated for high than for low trusters. Apparently, audience-tuning effects depend on the communicators' experience of a shared reality.
Rapid solution of large-scale systems of equations

NASA Technical Reports Server (NTRS)

Storaasli, Olaf O.

1994-01-01

The analysis and design of complex aerospace structures requires the rapid solution of large systems of linear and nonlinear equations, eigenvalue extraction for buckling, vibration and flutter modes, structural optimization and design sensitivity calculation. Computers with multiple processors and vector capabilities can offer substantial computational advantages over traditional scalar computer for these analyses. These computers fall into two categories: shared memory computers and distributed memory computers. This presentation covers general-purpose, highly efficient algorithms for generation/assembly or element matrices, solution of systems of linear and nonlinear equations, eigenvalue and design sensitivity analysis and optimization. All algorithms are coded in FORTRAN for shared memory computers and many are adapted to distributed memory computers. The capability and numerical performance of these algorithms will be addressed.
Simplified Parallel Domain Traversal

DOE Office of Scientific and Technical Information (OSTI.GOV)

Erickson III, David J

2011-01-01

Many data-intensive scientific analysis techniques require global domain traversal, which over the years has been a bottleneck for efficient parallelization across distributed-memory architectures. Inspired by MapReduce and other simplified parallel programming approaches, we have designed DStep, a flexible system that greatly simplifies efficient parallelization of domain traversal techniques at scale. In order to deliver both simplicity to users as well as scalability on HPC platforms, we introduce a novel two-tiered communication architecture for managing and exploiting asynchronous communication loads. We also integrate our design with advanced parallel I/O techniques that operate directly on native simulation output. We demonstrate DStep bymore » performing teleconnection analysis across ensemble runs of terascale atmospheric CO{sub 2} and climate data, and we show scalability results on up to 65,536 IBM BlueGene/P cores.« less
Examining age-related shared variance between face cognition, vision, and self-reported physical health: a test of the common cause hypothesis for social cognition

PubMed Central

Olderbak, Sally; Hildebrandt, Andrea; Wilhelm, Oliver

2015-01-01

The shared decline in cognitive abilities, sensory functions (e.g., vision and hearing), and physical health with increasing age is well documented with some research attributing this shared age-related decline to a single common cause (e.g., aging brain). We evaluate the extent to which the common cause hypothesis predicts associations between vision and physical health with social cognition abilities specifically face perception and face memory. Based on a sample of 443 adults (17–88 years old), we test a series of structural equation models, including Multiple Indicator Multiple Cause (MIMIC) models, and estimate the extent to which vision and self-reported physical health are related to face perception and face memory through a common factor, before and after controlling for their fluid cognitive component and the linear effects of age. Results suggest significant shared variance amongst these constructs, with a common factor explaining some, but not all, of the shared age-related variance. Also, we found that the relations of face perception, but not face memory, with vision and physical health could be completely explained by fluid cognition. Overall, results suggest that a single common cause explains most, but not all age-related shared variance with domain specific aging mechanisms evident. PMID:26321998
Automatic Generation of OpenMP Directives and Its Application to Computational Fluid Dynamics Codes

NASA Technical Reports Server (NTRS)

Yan, Jerry; Jin, Haoqiang; Frumkin, Michael; Yan, Jerry (Technical Monitor)

2000-01-01

The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. As great progress was made in hardware and software technologies, performance of parallel programs with compiler directives has demonstrated large improvement. The introduction of OpenMP directives, the industrial standard for shared-memory programming, has minimized the issue of portability. In this study, we have extended CAPTools, a computer-aided parallelization toolkit, to automatically generate OpenMP-based parallel programs with nominal user assistance. We outline techniques used in the implementation of the tool and discuss the application of this tool on the NAS Parallel Benchmarks and several computational fluid dynamics codes. This work demonstrates the great potential of using the tool to quickly port parallel programs and also achieve good performance that exceeds some of the commercial tools.
Experimental Evaluation of the Value Added by Raising a Reader and Supplemental Parent Training in Shared Reading

ERIC Educational Resources Information Center

Anthony, Jason L.; Williams, Jeffrey M.; Zhang, Zhoe; Landry, Susan H.; Dunkelberger, Martha J.

2014-01-01

Research Findings: In an effort toward developing a comprehensive, effective, scalable, and sustainable early childhood education program for at-risk populations, we conducted an experimental evaluation of the value added by 2 family involvement programs to the Texas Early Education Model (TEEM). A total of 91 preschool classrooms that served…
Storing a single photon as a spin wave entangled with a flying photon in the telecommunication bandwidth

NASA Astrophysics Data System (ADS)

Zhang, Wei; Ding, Dong-Sheng; Shi, Shuai; Li, Yan; Zhou, Zhi-Yuan; Shi, Bao-Sen; Guo, Guang-Can

2016-02-01

Quantum memory is an essential building block for quantum communication and scalable linear quantum computation. Storing two-color entangled photons with one photon being at the telecommunication (telecom) wavelength while the other photon is compatible with quantum memory has great advantages toward the realization of the fiber-based long-distance quantum communication with the aid of quantum repeaters. Here, we report an experimental realization of storing a photon entangled with a telecom photon in polarization as an atomic spin wave in a cold atomic ensemble, thus establishing the entanglement between the telecom-band photon and the atomic-ensemble memory in a polarization degree of freedom. The reconstructed density matrix and the violation of the Clauser-Horne-Shimony-Holt inequality clearly show the preservation of quantum entanglement during storage. Our result is very promising for establishing a long-distance quantum network based on cold atomic ensembles.
Binary Associative Memories as a Benchmark for Spiking Neuromorphic Hardware

PubMed Central

Stöckel, Andreas; Jenzen, Christoph; Thies, Michael; Rückert, Ulrich

2017-01-01

Large-scale neuromorphic hardware platforms, specialized computer systems for energy efficient simulation of spiking neural networks, are being developed around the world, for example as part of the European Human Brain Project (HBP). Due to conceptual differences, a universal performance analysis of these systems in terms of runtime, accuracy and energy efficiency is non-trivial, yet indispensable for further hard- and software development. In this paper we describe a scalable benchmark based on a spiking neural network implementation of the binary neural associative memory. We treat neuromorphic hardware and software simulators as black-boxes and execute exactly the same network description across all devices. Experiments on the HBP platforms under varying configurations of the associative memory show that the presented method allows to test the quality of the neuron model implementation, and to explain significant deviations from the expected reference output. PMID:28878642
Generation of Light with Multimode Time-Delayed Entanglement Using Storage in a Solid-State Spin-Wave Quantum Memory.

PubMed

Ferguson, Kate R; Beavan, Sarah E; Longdell, Jevon J; Sellars, Matthew J

2016-07-08

Here, we demonstrate generating and storing entanglement in a solid-state spin-wave quantum memory with on-demand readout using the process of rephased amplified spontaneous emission (RASE). Amplified spontaneous emission (ASE), resulting from an inverted ensemble of Pr^{3+} ions doped into a Y_{2}SiO_{5} crystal, generates entanglement between collective states of the praseodymium ensemble and the output light. The ensemble is then rephased using a four-level photon echo technique. Entanglement between the ASE and its echo is confirmed and the inseparability violation preserved when the RASE is stored as a spin wave for up to 5 μs. RASE is shown to be temporally multimode with almost perfect distinguishability between two temporal modes demonstrated. These results pave the way for the use of multimode solid-state quantum memories in scalable quantum networks.
Synthetic biology. Genomically encoded analog memory with precise in vivo DNA writing in living cell populations.

PubMed

Farzadfard, Fahim; Lu, Timothy K

2014-11-14

Cellular memory is crucial to many natural biological processes and sophisticated synthetic biology applications. Existing cellular memories rely on epigenetic switches or recombinases, which are limited in scalability and recording capacity. In this work, we use the DNA of living cell populations as genomic "tape recorders" for the analog and distributed recording of long-term event histories. We describe a platform for generating single-stranded DNA (ssDNA) in vivo in response to arbitrary transcriptional signals. When coexpressed with a recombinase, these intracellularly expressed ssDNAs target specific genomic DNA addresses, resulting in precise mutations that accumulate in cell populations as a function of the magnitude and duration of the inputs. This platform could enable long-term cellular recorders for environmental and biomedical applications, biological state machines, and enhanced genome engineering strategies. Copyright © 2014, American Association for the Advancement of Science.
The structural approach to shared knowledge: an application to engineering design teams.

PubMed

Avnet, Mark S; Weigel, Annalisa L

2013-06-01

We propose a methodology for analyzing shared knowledge in engineering design teams. Whereas prior work has focused on shared knowledge in small teams at a specific point in time, the model presented here is both scalable and dynamic. By quantifying team members' common views of design drivers, we build a network of shared mental models to reveal the structure of shared knowledge at a snapshot in time. Based on a structural comparison of networks at different points in time, a metric of change in shared knowledge is computed. Analysis of survey data from 12 conceptual space mission design sessions reveals a correlation between change in shared knowledge and each of several system attributes, including system development time, system mass, and technological maturity. From these results, we conclude that an early period of learning and consensus building could be beneficial to the design of engineered systems. Although we do not examine team performance directly, we demonstrate that shared knowledge is related to the technical design and thus provide a foundation for improving design products by incorporating the knowledge and thoughts of the engineering design team into the process.

Nonlinear secret image sharing scheme.

PubMed

Shin, Sang-Ho; Lee, Gil-Je; Yoo, Kee-Young

2014-01-01

Over the past decade, most of secret image sharing schemes have been proposed by using Shamir's technique. It is based on a linear combination polynomial arithmetic. Although Shamir's technique based secret image sharing schemes are efficient and scalable for various environments, there exists a security threat such as Tompa-Woll attack. Renvall and Ding proposed a new secret sharing technique based on nonlinear combination polynomial arithmetic in order to solve this threat. It is hard to apply to the secret image sharing. In this paper, we propose a (t, n)-threshold nonlinear secret image sharing scheme with steganography concept. In order to achieve a suitable and secure secret image sharing scheme, we adapt a modified LSB embedding technique with XOR Boolean algebra operation, define a new variable m, and change a range of prime p in sharing procedure. In order to evaluate efficiency and security of proposed scheme, we use the embedding capacity and PSNR. As a result of it, average value of PSNR and embedding capacity are 44.78 (dB) and 1.74t⌈log2 m⌉ bit-per-pixel (bpp), respectively.
Nonlinear Secret Image Sharing Scheme

PubMed Central

Shin, Sang-Ho; Yoo, Kee-Young

2014-01-01

Over the past decade, most of secret image sharing schemes have been proposed by using Shamir's technique. It is based on a linear combination polynomial arithmetic. Although Shamir's technique based secret image sharing schemes are efficient and scalable for various environments, there exists a security threat such as Tompa-Woll attack. Renvall and Ding proposed a new secret sharing technique based on nonlinear combination polynomial arithmetic in order to solve this threat. It is hard to apply to the secret image sharing. In this paper, we propose a (t, n)-threshold nonlinear secret image sharing scheme with steganography concept. In order to achieve a suitable and secure secret image sharing scheme, we adapt a modified LSB embedding technique with XOR Boolean algebra operation, define a new variable m, and change a range of prime p in sharing procedure. In order to evaluate efficiency and security of proposed scheme, we use the embedding capacity and PSNR. As a result of it, average value of PSNR and embedding capacity are 44.78 (dB) and 1.74t⌈log2⁡m⌉ bit-per-pixel (bpp), respectively. PMID:25140334
Scheduling for Locality in Shared-Memory Multiprocessors

DTIC Science & Technology

1993-05-01

Submitted in Partial Fulfillment of the Requirements for the Degree ’)iIC Q(JALfryT INSPECTED 5 DOCTOR OF PHILOSOPHY I Accesion For Supervised by NTIS CRAM... architecture on parallel program performance, explain the implications of this trend on popular parallel programming models, and propose system software to 0...decomoosition and scheduling algorithms. I. SUIUECT TERMS IS. NUMBER OF PAGES shared-memory multiprocessors; architecture trends; loop 110 scheduling
Advanced Development of Certified OS Kernels

DTIC Science & Technology

2015-06-01

It provides an infrastructure to map a physical page into multiple processes’ page maps in different address spaces. Their ownership mechanism ensures...of their shared memory infrastructure . Trap module The trap module specifies the behaviors of exception handlers and mCertiKOS system calls. In...layers), 1 pm for the shared memory infrastructure (3 layers), 3.5 pm for the thread management (10 layers), 1 pm for the process management (4 layers
6 DOF Nonlinear AUV Simulation Toolbox

DTIC Science & Technology

1997-01-01

is to supply a flexible 3D -simulation platform for motion visualization, in-lab debugging and testing of mission-specific strategies as well as those...Explorer are modular designed [Smith] in order to cut time and cost for vehicle recontlguration. A flexible 3D -simulation platform is desired to... 3D models. Current implemented modules include a nonlinear dynamic model for the OEX, shared memory and semaphore manager tools, shared memory monitor
A cache-aided multiprocessor rollback recovery scheme

NASA Technical Reports Server (NTRS)

Wu, Kun-Lung; Fuchs, W. Kent

1989-01-01

This paper demonstrates how previous uniprocessor cache-aided recovery schemes can be applied to multiprocessor architectures, for recovering from transient processor failures, utilizing private caches and a global shared memory. As with cache-aided uniprocessor recovery, the multiprocessor cache-aided recovery scheme of this paper can be easily integrated into standard bus-based snoopy cache coherence protocols. A consistent shared memory state is maintained without the necessity of global check-pointing.
Targeted Memory Reactivation during Sleep Adaptively Promotes the Strengthening or Weakening of Overlapping Memories.

PubMed

Oyarzún, Javiera P; Morís, Joaquín; Luque, David; de Diego-Balaguer, Ruth; Fuentemilla, Lluís

2017-08-09

System memory consolidation is conceptualized as an active process whereby newly encoded memory representations are strengthened through selective memory reactivation during sleep. However, our learning experience is highly overlapping in content (i.e., shares common elements), and memories of these events are organized in an intricate network of overlapping associated events. It remains to be explored whether and how selective memory reactivation during sleep has an impact on these overlapping memories acquired during awake time. Here, we test in a group of adult women and men the prediction that selective memory reactivation during sleep entails the reactivation of associated events and that this may lead the brain to adaptively regulate whether these associated memories are strengthened or pruned from memory networks on the basis of their relative associative strength with the shared element. Our findings demonstrate the existence of efficient regulatory neural mechanisms governing how complex memory networks are shaped during sleep as a function of their associative memory strength. SIGNIFICANCE STATEMENT Numerous studies have demonstrated that system memory consolidation is an active, selective, and sleep-dependent process in which only subsets of new memories become stabilized through their reactivation. However, the learning experience is highly overlapping in content and thus events are encoded in an intricate network of related memories. It remains to be explored whether and how memory reactivation has an impact on overlapping memories acquired during awake time. Here, we show that sleep memory reactivation promotes strengthening and weakening of overlapping memories based on their associative memory strength. These results suggest the existence of an efficient regulatory neural mechanism that avoids the formation of cluttered memory representation of multiple events and promotes stabilization of complex memory networks. Copyright © 2017 the authors 0270-6474/17/377748-11$15.00/0.
Autobiographical memory functions of nostalgia in comparison to rumination and counterfactual thinking: similarity and uniqueness.

PubMed

Cheung, Wing-Yee; Wildschut, Tim; Sedikides, Constantine

2018-02-01

We compared and contrasted nostalgia with rumination and counterfactual thinking in terms of their autobiographical memory functions. Specifically, we assessed individual differences in nostalgia, rumination, and counterfactual thinking, which we then linked to self-reported functions or uses of autobiographical memory (Self-Regard, Boredom Reduction, Death Preparation, Intimacy Maintenance, Conversation, Teach/Inform, and Bitterness Revival). We tested which memory functions are shared and which are uniquely linked to nostalgia. The commonality among nostalgia, rumination, and counterfactual thinking resides in their shared positive associations with all memory functions: individuals who evinced a stronger propensity towards past-oriented thought (as manifested in nostalgia, rumination, and counterfactual thinking) reported greater overall recruitment of memories in the service of present functioning. The uniqueness of nostalgia resides in its comparatively strong positive associations with Intimacy Maintenance, Teach/Inform, and Self-Regard and weak association with Bitterness Revival. In all, nostalgia possesses a more positive functional signature than do rumination and counterfactual thinking.
Mnemonic convergence in social networks: The emergent properties of cognition at a collective level.

PubMed

Coman, Alin; Momennejad, Ida; Drach, Rae D; Geana, Andra

2016-07-19

The development of shared memories, beliefs, and norms is a fundamental characteristic of human communities. These emergent outcomes are thought to occur owing to a dynamic system of information sharing and memory updating, which fundamentally depends on communication. Here we report results on the formation of collective memories in laboratory-created communities. We manipulated conversational network structure in a series of real-time, computer-mediated interactions in fourteen 10-member communities. The results show that mnemonic convergence, measured as the degree of overlap among community members' memories, is influenced by both individual-level information-processing phenomena and by the conversational social network structure created during conversational recall. By studying laboratory-created social networks, we show how large-scale social phenomena (i.e., collective memory) can emerge out of microlevel local dynamics (i.e., mnemonic reinforcement and suppression effects). The social-interactionist approach proposed herein points to optimal strategies for spreading information in social networks and provides a framework for measuring and forging collective memories in communities of individuals.
Radiologic image communication and archive service: a secure, scalable, shared approach

NASA Astrophysics Data System (ADS)

Fellingham, Linda L.; Kohli, Jagdish C.

1995-11-01

The Radiologic Image Communication and Archive (RICA) service is designed to provide a shared archive for medical images to the widest possible audience of customers. Images are acquired from a number of different modalities, each available from many different vendors. Images are acquired digitally from those modalities which support direct digital output and by digitizing films for projection x-ray exams. The RICA Central Archive receives standard DICOM 3.0 messages and data streams from the medical imaging devices at customer institutions over the public telecommunication network. RICA represents a completely scalable resource. The user pays only for what he is using today with the full assurance that as the volume of image data that he wishes to send to the archive increases, the capacity will be there to accept it. To provide this seamless scalability imposes several requirements on the RICA architecture: (1) RICA must support the full array of transport services. (2) The Archive Interface must scale cost-effectively to support local networks that range from the very small (one x-ray digitizer in a medical clinic) to the very large and complex (a large hospital with several CTs, MRs, Nuclear medicine devices, ultrasound machines, CRs, and x-ray digitizers). (3) The Archive Server must scale cost-effectively to support rapidly increasing demands for service providing storage for and access to millions of patients and hundreds of millions of images. The architecture must support the incorporation of improved technology as it becomes available to maintain performance and remain cost-effective as demand rises.
Using memories to understand others: the role of episodic memory in theory of mind impairment in Alzheimer disease.

PubMed

Moreau, Noémie; Viallet, François; Champagne-Lavau, Maud

2013-09-01

Theory of mind (TOM) refers to the ability to infer one's own and other's mental states. Growing evidence highlighted the presence of impairment on the most complex TOM tasks in Alzheimer disease (AD). However, how TOM deficit is related to other cognitive dysfunctions and more specifically to episodic memory impairment - the prominent feature of this disease - is still under debate. Recent neuroanatomical findings have shown that remembering past events and inferring others' states of mind share the same cerebral network suggesting the two abilities share a common process .This paper proposes to review emergent evidence of TOM impairment in AD patients and to discuss the evidence of a relationship between TOM and episodic memory. We will discuss about AD patients' deficit in TOM being possibly related to their difficulties in recollecting memories of past social interactions. Copyright © 2013 Elsevier B.V. All rights reserved.
Mental time travel and the shaping of the human mind

PubMed Central

Suddendorf, Thomas; Addis, Donna Rose; Corballis, Michael C.

2009-01-01

Episodic memory, enabling conscious recollection of past episodes, can be distinguished from semantic memory, which stores enduring facts about the world. Episodic memory shares a core neural network with the simulation of future episodes, enabling mental time travel into both the past and the future. The notion that there might be something distinctly human about mental time travel has provoked ingenious attempts to demonstrate episodic memory or future simulation in non-human animals, but we argue that they have not yet established a capacity comparable to the human faculty. The evolution of the capacity to simulate possible future events, based on episodic memory, enhanced fitness by enabling action in preparation of different possible scenarios that increased present or future survival and reproduction chances. Human language may have evolved in the first instance for the sharing of past and planned future events, and, indeed, fictional ones, further enhancing fitness in social settings. PMID:19528013
Performance Management of High Performance Computing for Medical Image Processing in Amazon Web Services.

PubMed

Bao, Shunxing; Damon, Stephen M; Landman, Bennett A; Gokhale, Aniruddha

2016-02-27

Adopting high performance cloud computing for medical image processing is a popular trend given the pressing needs of large studies. Amazon Web Services (AWS) provide reliable, on-demand, and inexpensive cloud computing services. Our research objective is to implement an affordable, scalable and easy-to-use AWS framework for the Java Image Science Toolkit (JIST). JIST is a plugin for Medical-Image Processing, Analysis, and Visualization (MIPAV) that provides a graphical pipeline implementation allowing users to quickly test and develop pipelines. JIST is DRMAA-compliant allowing it to run on portable batch system grids. However, as new processing methods are implemented and developed, memory may often be a bottleneck for not only lab computers, but also possibly some local grids. Integrating JIST with the AWS cloud alleviates these possible restrictions and does not require users to have deep knowledge of programming in Java. Workflow definition/management and cloud configurations are two key challenges in this research. Using a simple unified control panel, users have the ability to set the numbers of nodes and select from a variety of pre-configured AWS EC2 nodes with different numbers of processors and memory storage. Intuitively, we configured Amazon S3 storage to be mounted by pay-for-use Amazon EC2 instances. Hence, S3 storage is recognized as a shared cloud resource. The Amazon EC2 instances provide pre-installs of all necessary packages to run JIST. This work presents an implementation that facilitates the integration of JIST with AWS. We describe the theoretical cost/benefit formulae to decide between local serial execution versus cloud computing and apply this analysis to an empirical diffusion tensor imaging pipeline.
Performance management of high performance computing for medical image processing in Amazon Web Services

NASA Astrophysics Data System (ADS)

Bao, Shunxing; Damon, Stephen M.; Landman, Bennett A.; Gokhale, Aniruddha

2016-03-01

Adopting high performance cloud computing for medical image processing is a popular trend given the pressing needs of large studies. Amazon Web Services (AWS) provide reliable, on-demand, and inexpensive cloud computing services. Our research objective is to implement an affordable, scalable and easy-to-use AWS framework for the Java Image Science Toolkit (JIST). JIST is a plugin for Medical- Image Processing, Analysis, and Visualization (MIPAV) that provides a graphical pipeline implementation allowing users to quickly test and develop pipelines. JIST is DRMAA-compliant allowing it to run on portable batch system grids. However, as new processing methods are implemented and developed, memory may often be a bottleneck for not only lab computers, but also possibly some local grids. Integrating JIST with the AWS cloud alleviates these possible restrictions and does not require users to have deep knowledge of programming in Java. Workflow definition/management and cloud configurations are two key challenges in this research. Using a simple unified control panel, users have the ability to set the numbers of nodes and select from a variety of pre-configured AWS EC2 nodes with different numbers of processors and memory storage. Intuitively, we configured Amazon S3 storage to be mounted by pay-for- use Amazon EC2 instances. Hence, S3 storage is recognized as a shared cloud resource. The Amazon EC2 instances provide pre-installs of all necessary packages to run JIST. This work presents an implementation that facilitates the integration of JIST with AWS. We describe the theoretical cost/benefit formulae to decide between local serial execution versus cloud computing and apply this analysis to an empirical diffusion tensor imaging pipeline.
Performance Management of High Performance Computing for Medical Image Processing in Amazon Web Services

PubMed Central

Bao, Shunxing; Damon, Stephen M.; Landman, Bennett A.; Gokhale, Aniruddha

2016-01-01

Adopting high performance cloud computing for medical image processing is a popular trend given the pressing needs of large studies. Amazon Web Services (AWS) provide reliable, on-demand, and inexpensive cloud computing services. Our research objective is to implement an affordable, scalable and easy-to-use AWS framework for the Java Image Science Toolkit (JIST). JIST is a plugin for Medical-Image Processing, Analysis, and Visualization (MIPAV) that provides a graphical pipeline implementation allowing users to quickly test and develop pipelines. JIST is DRMAA-compliant allowing it to run on portable batch system grids. However, as new processing methods are implemented and developed, memory may often be a bottleneck for not only lab computers, but also possibly some local grids. Integrating JIST with the AWS cloud alleviates these possible restrictions and does not require users to have deep knowledge of programming in Java. Workflow definition/management and cloud configurations are two key challenges in this research. Using a simple unified control panel, users have the ability to set the numbers of nodes and select from a variety of pre-configured AWS EC2 nodes with different numbers of processors and memory storage. Intuitively, we configured Amazon S3 storage to be mounted by pay-for-use Amazon EC2 instances. Hence, S3 storage is recognized as a shared cloud resource. The Amazon EC2 instances provide pre-installs of all necessary packages to run JIST. This work presents an implementation that facilitates the integration of JIST with AWS. We describe the theoretical cost/benefit formulae to decide between local serial execution versus cloud computing and apply this analysis to an empirical diffusion tensor imaging pipeline. PMID:27127335
Study the effect of reservoir spatial heterogeneity on CO2 sequestration under an uncertainty quantification (UQ) software framework

NASA Astrophysics Data System (ADS)

Fang, Y.; Hou, J.; Engel, D.; Lin, G.; Yin, J.; Han, B.; Fang, Z.; Fountoulakis, V.

2011-12-01

In this study, we introduce an uncertainty quantification (UQ) software framework for carbon sequestration, with the focus of studying being the effect of spatial heterogeneity of reservoir properties on CO2 migration. We use a sequential Gaussian method (SGSIM) to generate realizations of permeability fields with various spatial statistical attributes. To deal with the computational difficulties, we integrate the following ideas/approaches: 1) firstly, we use three different sampling approaches (probabilistic collocation, quasi-Monte Carlo, and adaptive sampling approaches) to reduce the required forward calculations while trying to explore the parameter space and quantify the input uncertainty; 2) secondly, we use eSTOMP as the forward modeling simulator. eSTOMP is implemented using the Global Arrays toolkit (GA) that is based on one-sided inter-processor communication and supports a shared memory programming style on distributed memory platforms. It provides highly-scalable performance. It uses a data model to partition most of the large scale data structures into a relatively small number of distinct classes. The lower level simulator infrastructure (e.g. meshing support, associated data structures, and data mapping to processors) is separated from the higher level physics and chemistry algorithmic routines using a grid component interface; and 3) besides the faster model and more efficient algorithms to speed up the forward calculation, we built an adaptive system infrastructure to select the best possible data transfer mechanisms, to optimally allocate system resources to improve performance, and to integrate software packages and data for composing carbon sequestration simulation, computation, analysis, estimation and visualization. We will demonstrate the framework with a given CO2 injection scenario in a heterogeneous sandstone reservoir.
A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hoemmen, Mark

2010-11-01

Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, 'communication' includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches formore » orthogonalizing the vectors within each block ('normalization'). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5-20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.« less
Shared prefetching to reduce execution skew in multi-threaded systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Eichenberger, Alexandre E; Gunnels, John A

Mechanisms are provided for optimizing code to perform prefetching of data into a shared memory of a computing device that is shared by a plurality of threads that execute on the computing device. A memory stream of a portion of code that is shared by the plurality of threads is identified. A set of prefetch instructions is distributed across the plurality of threads. Prefetch instructions are inserted into the instruction sequences of the plurality of threads such that each instruction sequence has a separate sub-portion of the set of prefetch instructions, thereby generating optimized code. Executable code is generated basedmore » on the optimized code and stored in a storage device. The executable code, when executed, performs the prefetches associated with the distributed set of prefetch instructions in a shared manner across the plurality of threads.« less
Construction of a smart medication dispenser with high degree of scalability and remote manageability.

PubMed

Pak, JuGeon; Park, KeeHyun

2012-01-01

We propose a smart medication dispenser having a high degree of scalability and remote manageability. We construct the dispenser to have extensible hardware architecture for achieving scalability, and we install an agent program in it for achieving remote manageability. The dispenser operates as follows: when the real-time clock reaches the predetermined medication time and the user presses the dispense button at that time, the predetermined medication is dispensed from the medication dispensing tray (MDT). In the proposed dispenser, the medication for each patient is stored in an MDT. One smart medication dispenser contains mainly one MDT; however, the dispenser can be extended to include more MDTs in order to support multiple users using one dispenser. For remote management, the proposed dispenser transmits the medication status and the system configurations to the monitoring server. In the case of a specific event such as a shortage of medication, memory overload, software error, or non-adherence, the event is transmitted immediately. All these operations are performed automatically without the intervention of patients, through the agent program installed in the dispenser. Results of implementation and verification show that the proposed dispenser operates normally and performs the management operations from the medication monitoring server suitably.
Approximate l-fold cross-validation with Least Squares SVM and Kernel Ridge Regression

DOE Office of Scientific and Technical Information (OSTI.GOV)

Edwards, Richard E; Zhang, Hao; Parker, Lynne Edwards

2013-01-01

Kernel methods have difficulties scaling to large modern data sets. The scalability issues are based on computational and memory requirements for working with a large matrix. These requirements have been addressed over the years by using low-rank kernel approximations or by improving the solvers scalability. However, Least Squares Support VectorMachines (LS-SVM), a popular SVM variant, and Kernel Ridge Regression still have several scalability issues. In particular, the O(n^3) computational complexity for solving a single model, and the overall computational complexity associated with tuning hyperparameters are still major problems. We address these problems by introducing an O(n log n) approximate l-foldmore » cross-validation method that uses a multi-level circulant matrix to approximate the kernel. In addition, we prove our algorithm s computational complexity and present empirical runtimes on data sets with approximately 1 million data points. We also validate our approximate method s effectiveness at selecting hyperparameters on real world and standard benchmark data sets. Lastly, we provide experimental results on using a multi-level circulant kernel approximation to solve LS-SVM problems with hyperparameters selected using our method.« less

A shared neural ensemble links distinct contextual memories encoded close in time

NASA Astrophysics Data System (ADS)

Cai, Denise J.; Aharoni, Daniel; Shuman, Tristan; Shobe, Justin; Biane, Jeremy; Song, Weilin; Wei, Brandon; Veshkini, Michael; La-Vu, Mimi; Lou, Jerry; Flores, Sergio E.; Kim, Isaac; Sano, Yoshitake; Zhou, Miou; Baumgaertel, Karsten; Lavi, Ayal; Kamata, Masakazu; Tuszynski, Mark; Mayford, Mark; Golshani, Peyman; Silva, Alcino J.

2016-06-01

Recent studies suggest that a shared neural ensemble may link distinct memories encoded close in time. According to the memory allocation hypothesis, learning triggers a temporary increase in neuronal excitability that biases the representation of a subsequent memory to the neuronal ensemble encoding the first memory, such that recall of one memory increases the likelihood of recalling the other memory. Here we show in mice that the overlap between the hippocampal CA1 ensembles activated by two distinct contexts acquired within a day is higher than when they are separated by a week. Several findings indicate that this overlap of neuronal ensembles links two contextual memories. First, fear paired with one context is transferred to a neutral context when the two contexts are acquired within a day but not across a week. Second, the first memory strengthens the second memory within a day but not across a week. Older mice, known to have lower CA1 excitability, do not show the overlap between ensembles, the transfer of fear between contexts, or the strengthening of the second memory. Finally, in aged mice, increasing cellular excitability and activating a common ensemble of CA1 neurons during two distinct context exposures rescued the deficit in linking memories. Taken together, these findings demonstrate that contextual memories encoded close in time are linked by directing storage into overlapping ensembles. Alteration of these processes by ageing could affect the temporal structure of memories, thus impairing efficient recall of related information.
Factor structure of overall autobiographical memory usage: the directive, self and social functions revisited.

PubMed

Rasmussen, Anne S; Habermas, Tilmann

2011-08-01

According to theory, autobiographical memory serves three broad functions of overall usage: directive, self, and social. However, there is evidence to suggest that the tripartite model may be better conceptualised in terms of a four-factor model with two social functions. In the present study we examined the two models in Danish and German samples, using the Thinking About Life Experiences Questionnaire (TALE; Bluck, Alea, Habermas, & Rubin, 2005), which measures the overall usage of the three functions generalised across concrete memories. Confirmatory factor analysis supported the four-factor model and rejected the theoretical three-factor model in both samples. The results are discussed in relation to cultural differences in overall autobiographical memory usage as well as sharing versus non-sharing aspects of social remembering.
Automated quantitative muscle biopsy analysis system

NASA Technical Reports Server (NTRS)

Castleman, Kenneth R. (Inventor)

1980-01-01

An automated system to aid the diagnosis of neuromuscular diseases by producing fiber size histograms utilizing histochemically stained muscle biopsy tissue. Televised images of the microscopic fibers are processed electronically by a multi-microprocessor computer, which isolates, measures, and classifies the fibers and displays the fiber size distribution. The architecture of the multi-microprocessor computer, which is iterated to any required degree of complexity, features a series of individual microprocessors P.sub.n each receiving data from a shared memory M.sub.n-1 and outputing processed data to a separate shared memory M.sub.n+1 under control of a program stored in dedicated memory M.sub.n.
Experimental evaluation of multiprocessor cache-based error recovery

NASA Technical Reports Server (NTRS)

Janssens, Bob; Fuchs, W. K.

1991-01-01

Several variations of cache-based checkpointing for rollback error recovery in shared-memory multiprocessors have been recently developed. By modifying the cache replacement policy, these techniques use the inherent redundancy in the memory hierarchy to periodically checkpoint the computation state. Three schemes, different in the manner in which they avoid rollback propagation, are evaluated. By simulation with address traces from parallel applications running on an Encore Multimax shared-memory multiprocessor, the performance effect of integrating the recovery schemes in the cache coherence protocol are evaluated. The results indicate that the cache-based schemes can provide checkpointing capability with low performance overhead but uncontrollable high variability in the checkpoint interval.
AHPCRC (Army High Performance Computing Research Center) Bulletin. Volume 2, Issue 1

DTIC Science & Technology

2010-01-01

Researchers in AHPCRC Technical Area 4 focus on improving processes for developing scalable, accurate parallel programs that are easily ported from one...control number. 1. REPORT DATE 2011 2. REPORT TYPE 3. DATES COVERED 00-00-2011 to 00-00-2011 4 . TITLE AND SUBTITLE AHPCRC (Army High...continued on page 4 Virtual levels in Sequoia represent an abstract memory hierarchy without specifying data transfer mechanisms, giving the
The FORCE - A highly portable parallel programming language

NASA Technical Reports Server (NTRS)

Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger

1989-01-01

This paper explains why the FORCE parallel programming language is easily portable among six different shared-memory multiprocessors, and how a two-level macro preprocessor makes it possible to hide low-level machine dependencies and to build machine-independent high-level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared-memory multiprocessor executing them.
The FORCE: A highly portable parallel programming language

NASA Technical Reports Server (NTRS)

Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger

1989-01-01

Here, it is explained why the FORCE parallel programming language is easily portable among six different shared-memory microprocessors, and how a two-level macro preprocessor makes it possible to hide low level machine dependencies and to build machine-independent high level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared memory multiprocessor executing them.
Hybrid MPI+OpenMP Programming of an Overset CFD Solver and Performance Investigations

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Jin, Haoqiang H.; Biegel, Bryan (Technical Monitor)

2002-01-01

This report describes a two level parallelization of a Computational Fluid Dynamic (CFD) solver with multi-zone overset structured grids. The approach is based on a hybrid MPI+OpenMP programming model suitable for shared memory and clusters of shared memory machines. The performance investigations of the hybrid application on an SGI Origin2000 (O2K) machine is reported using medium and large scale test problems.
Performing a local reduction operation on a parallel computer

DOEpatents

Blocksome, Michael A; Faraj, Daniel A

2013-06-04

A parallel computer including compute nodes, each including two reduction processing cores, a network write processing core, and a network read processing core, each processing core assigned an input buffer. Copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory; copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory; copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; and locally reducing in parallel by the reduction processing cores: the contents of the reduction processing core's input buffer; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer.
Performing a local reduction operation on a parallel computer

DOEpatents

Blocksome, Michael A.; Faraj, Daniel A.

2012-12-11

A parallel computer including compute nodes, each including two reduction processing cores, a network write processing core, and a network read processing core, each processing core assigned an input buffer. Copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory; copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory; copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; and locally reducing in parallel by the reduction processing cores: the contents of the reduction processing core's input buffer; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer.
Reducing I/O variability using dynamic I/O path characterization in petascale storage systems

DOE PAGES

Son, Seung Woo; Sehrish, Saba; Liao, Wei-keng; ...

2016-11-01

In petascale systems with a million CPU cores, scalable and consistent I/O performance is becoming increasingly difficult to sustain mainly because of I/O variability. Furthermore, the I/O variability is caused by concurrently running processes/jobs competing for I/O or a RAID rebuild when a disk drive fails. We present a mechanism that stripes across a selected subset of I/O nodes with the lightest workload at runtime to achieve the highest I/O bandwidth available in the system. In this paper, we propose a probing mechanism to enable application-level dynamic file striping to mitigate I/O variability. We also implement the proposed mechanism inmore » the high-level I/O library that enables memory-to-file data layout transformation and allows transparent file partitioning using subfiling. Subfiling is a technique that partitions data into a set of files of smaller size and manages file access to them, making data to be treated as a single, normal file to users. Here, we demonstrate that our bandwidth probing mechanism can successfully identify temporally slower I/O nodes without noticeable runtime overhead. Experimental results on NERSC’s systems also show that our approach isolates I/O variability effectively on shared systems and improves overall collective I/O performance with less variation.« less
Thermomechanically coupled conduction mode laser welding simulations using smoothed particle hydrodynamics

NASA Astrophysics Data System (ADS)

Hu, Haoyue; Eberhard, Peter

2017-10-01

Process simulations of conduction mode laser welding are performed using the meshless Lagrangian smoothed particle hydrodynamics (SPH) method. The solid phase is modeled based on the governing equations in thermoelasticity. For the liquid phase, surface tension effects are taken into account to simulate the melt flow in the weld pool, including the Marangoni force caused by a temperature-dependent surface tension gradient. A non-isothermal solid-liquid phase transition with the release or absorption of additional energy known as the latent heat of fusion is considered. The major heat transfer through conduction is modeled, whereas heat convection and radiation are neglected. The energy input from the laser beam is modeled as a Gaussian heat source acting on the initial material surface. The developed model is implemented in Pasimodo. Numerical results obtained with the model are presented for laser spot welding and seam welding of aluminum and iron. The change of process parameters like welding speed and laser power, and their effects on weld dimensions are investigated. Furthermore, simulations may be useful to obtain the threshold for deep penetration welding and to assess the overall welding quality. A scalability and performance analysis of the implemented SPH algorithm in Pasimodo is run in a shared memory environment. The analysis reveals the potential of large welding simulations on multi-core machines.
Progress Towards a Rad-Hydro Code for Modern Computing Architectures LA-UR-10-02825

NASA Astrophysics Data System (ADS)

Wohlbier, J. G.; Lowrie, R. B.; Bergen, B.; Calef, M.

2010-11-01

We are entering an era of high performance computing where data movement is the overwhelming bottleneck to scalable performance, as opposed to the speed of floating-point operations per processor. All multi-core hardware paradigms, whether heterogeneous or homogeneous, be it the Cell processor, GPGPU, or multi-core x86, share this common trait. In multi-physics applications such as inertial confinement fusion or astrophysics, one may be solving multi-material hydrodynamics with tabular equation of state data lookups, radiation transport, nuclear reactions, and charged particle transport in a single time cycle. The algorithms are intensely data dependent, e.g., EOS, opacity, nuclear data, and multi-core hardware memory restrictions are forcing code developers to rethink code and algorithm design. For the past two years LANL has been funding a small effort referred to as Multi-Physics on Multi-Core to explore ideas for code design as pertaining to inertial confinement fusion and astrophysics applications. The near term goals of this project are to have a multi-material radiation hydrodynamics capability, with tabular equation of state lookups, on cartesian and curvilinear block structured meshes. In the longer term we plan to add fully implicit multi-group radiation diffusion and material heat conduction, and block structured AMR. We will report on our progress to date.
Memory loss

MedlinePlus

... this page: //medlineplus.gov/ency/article/003257.htm Memory loss To use the sharing features on this ... Bethesda, MD 20894 U.S. Department of Health and Human Services National Institutes of Health Page last updated: ...
Volumetric Medical Image Coding: An Object-based, Lossy-to-lossless and Fully Scalable Approach

PubMed Central

Danyali, Habibiollah; Mertins, Alfred

2011-01-01

In this article, an object-based, highly scalable, lossy-to-lossless 3D wavelet coding approach for volumetric medical image data (e.g., magnetic resonance (MR) and computed tomography (CT)) is proposed. The new method, called 3DOBHS-SPIHT, is based on the well-known set partitioning in the hierarchical trees (SPIHT) algorithm and supports both quality and resolution scalability. The 3D input data is grouped into groups of slices (GOS) and each GOS is encoded and decoded as a separate unit. The symmetric tree definition of the original 3DSPIHT is improved by introducing a new asymmetric tree structure. While preserving the compression efficiency, the new tree structure allows for a small size of each GOS, which not only reduces memory consumption during the encoding and decoding processes, but also facilitates more efficient random access to certain segments of slices. To achieve more compression efficiency, the algorithm only encodes the main object of interest in each 3D data set, which can have any arbitrary shape, and ignores the unnecessary background. The experimental results on some MR data sets show the good performance of the 3DOBHS-SPIHT algorithm for multi-resolution lossy-to-lossless coding. The compression efficiency, full scalability, and object-based features of the proposed approach, beside its lossy-to-lossless coding support, make it a very attractive candidate for volumetric medical image information archiving and transmission applications. PMID:22606653
NOA: A Scalable Multi-Parent Clustering Hierarchy for WSNs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cree, Johnathan V.; Delgado-Frias, Jose; Hughes, Michael A.

2012-08-10

NOA is a multi-parent, N-tiered, hierarchical clustering algorithm that provides a scalable, robust and reliable solution to autonomous configuration of large-scale wireless sensor networks. The novel clustering hierarchy's inherent benefits can be utilized by in-network data processing techniques to provide equally robust, reliable and scalable in-network data processing solutions capable of reducing the amount of data sent to sinks. Utilizing a multi-parent framework, NOA reduces the cost of network setup when compared to hierarchical beaconing solutions by removing the expense of r-hop broadcasting (r is the radius of the cluster) needed to build the network and instead passes network topologymore » information among shared children. NOA2, a two-parent clustering hierarchy solution, and NOA3, the three-parent variant, saw up to an 83% and 72% reduction in overhead, respectively, when compared to performing one round of a one-parent hierarchical beaconing, as well as 92% and 88% less overhead when compared to one round of two- and three-parent hierarchical beaconing hierarchy.« less
We Remember, We Forget: Collaborative Remembering in Older Couples

ERIC Educational Resources Information Center

Harris, Celia B.; Keil, Paul G.; Sutton, John; Barnier, Amanda J.; McIlwain, Doris J. F.

2011-01-01

Transactive memory theory describes the processes by which benefits for memory can occur when remembering is shared in dyads or groups. In contrast, cognitive psychology experiments demonstrate that social influences on memory disrupt and inhibit individual recall. However, most research in cognitive psychology has focused on groups of strangers…
76 FR 12821 - 150th Anniversary of the Inauguration of Abraham Lincoln

Federal Register 2010, 2011, 2012, 2013, 2014

2011-03-09

... together by shared memories and common hopes. As we observe the 150th anniversary of his Inauguration, we... his memory enabled America to move beyond a young collection of States to become a free and unified... memory and uphold the principles he so nobly advanced. [[Page 12822
Expert Systems on Multiprocessor Architectures. Volume 2. Technical Reports

DTIC Science & Technology

1991-06-01

Report RC 12936 (#58037). IBM T. J. Wartson Reiearch Center. July 1987. � Alan Jay Smith. Cache memories. Coniputing Sitrry., 1.1(3): I.3-5:30...basic-shared is an instrument for ashared memory design. The components panels are processor- qload-scrolling-bar-panel, memory-qload-scrolling-bar-panel
Space Situational Awareness Data Processing Scalability Utilizing Google Cloud Services

NASA Astrophysics Data System (ADS)

Greenly, D.; Duncan, M.; Wysack, J.; Flores, F.

Space Situational Awareness (SSA) is a fundamental and critical component of current space operations. The term SSA encompasses the awareness, understanding and predictability of all objects in space. As the population of orbital space objects and debris increases, the number of collision avoidance maneuvers grows and prompts the need for accurate and timely process measures. The SSA mission continually evolves to near real-time assessment and analysis demanding the need for higher processing capabilities. By conventional methods, meeting these demands requires the integration of new hardware to keep pace with the growing complexity of maneuver planning algorithms. SpaceNav has implemented a highly scalable architecture that will track satellites and debris by utilizing powerful virtual machines on the Google Cloud Platform. SpaceNav algorithms for processing CDMs outpace conventional means. A robust processing environment for tracking data, collision avoidance maneuvers and various other aspects of SSA can be created and deleted on demand. Migrating SpaceNav tools and algorithms into the Google Cloud Platform will be discussed and the trials and tribulations involved. Information will be shared on how and why certain cloud products were used as well as integration techniques that were implemented. Key items to be presented are: 1.Scientific algorithms and SpaceNav tools integrated into a scalable architecture a) Maneuver Planning b) Parallel Processing c) Monte Carlo Simulations d) Optimization Algorithms e) SW Application Development/Integration into the Google Cloud Platform 2. Compute Engine Processing a) Application Engine Automated Processing b) Performance testing and Performance Scalability c) Cloud MySQL databases and Database Scalability d) Cloud Data Storage e) Redundancy and Availability

Blanket Gate Would Address Blocks Of Memory

NASA Technical Reports Server (NTRS)

Lambe, John; Moopenn, Alexander; Thakoor, Anilkumar P.

1988-01-01

Circuit-chip area used more efficiently. Proposed gate structure selectively allows and restricts access to blocks of memory in electronic neural-type network. By breaking memory into independent blocks, gate greatly simplifies problem of reading from and writing to memory. Since blocks not used simultaneously, share operational amplifiers that prompt and read information stored in memory cells. Fewer operational amplifiers needed, and chip area occupied reduced correspondingly. Cost per bit drops as result.
The potential of multi-port optical memories in digital computing

NASA Technical Reports Server (NTRS)

Alford, C. O.; Gaylord, T. K.

1975-01-01

A high-capacity memory with a relatively high data transfer rate and multi-port simultaneous access capability may serve as the basis for new computer architectures. The implementation of a multi-port optical memory is discussed. Several computer structures are presented that might profitably use such a memory. These structures include (1) a simultaneous record access system, (2) a simultaneously shared memory computer system, and (3) a parallel digital processing structure.
Ferroelectric memory based on molybdenum disulfide and ferroelectric hafnium oxide

NASA Astrophysics Data System (ADS)

Yap, Wui Chung; Jiang, Hao; Xia, Qiangfei; Zhu, Wenjuan

Recently, ferroelectric hafnium oxide (HfO2) was discovered as a new type of ferroelectric material with the advantages of high coercive field, excellent scalability (down to 2.5 nm), and good compatibility with CMOS processing. In this work, we demonstrate, for the first time, 2D ferroelectric memories with molybdenum disulfide (MoS2) as the channel material and aluminum doped HfO2 as the ferroelectric gate dielectric. A 16 nm thick layer of HfO2, doped with 5.26% aluminum, was deposited via atomic layer deposition (ALD), then subjected to rapid thermal annealing (RTA) at 1000 °C, and the polarization-voltage characteristics of the resulting metal-ferroelectric-metal (MFM) capacitors were measured, showing a remnant polarization of 0.6 μC/cm2. Ferroelectric memories with embedded ferroelectric hafnium oxide stacks and monolayer MoS2 were fabricated. The transfer characteristics after program and erase pulses revealed a clear ferroelectric memory window. In addition, endurance (up to 10,000 cycles) of the devices were tested and effects associated with ferroelectric materials, such as the wake-up effect and polarization fatigue, were observed. This research can potentially lead to advances of 2D materials in low-power logic and memory applications.
Local wavelet transform: a cost-efficient custom processor for space image compression

NASA Astrophysics Data System (ADS)

Masschelein, Bart; Bormans, Jan G.; Lafruit, Gauthier

2002-11-01

Thanks to its intrinsic scalability features, the wavelet transform has become increasingly popular as decorrelator in image compression applications. Throuhgput, memory requirements and complexity are important parameters when developing hardware image compression modules. An implementation of the classical, global wavelet transform requires large memory sizes and implies a large latency between the availability of the input image and the production of minimal data entities for entropy coding. Image tiling methods, as proposed by JPEG2000, reduce the memory sizes and the latency, but inevitably introduce image artefacts. The Local Wavelet Transform (LWT), presented in this paper, is a low-complexity wavelet transform architecture using a block-based processing that results in the same transformed images as those obtained by the global wavelet transform. The architecture minimizes the processing latency with a limited amount of memory. Moreover, as the LWT is an instruction-based custom processor, it can be programmed for specific tasks, such as push-broom processing of infinite-length satelite images. The features of the LWT makes it appropriate for use in space image compression, where high throughput, low memory sizes, low complexity, low power and push-broom processing are important requirements.
Sharing knowledge with the public during a crisis: NASA's public portal

NASA Technical Reports Server (NTRS)

Holm, Jeanne

2003-01-01

This case study looks at integrating the web governance policies and procedures, migration to a single content management solution, and integrating best-of-breed technology with high-impact, interactive components. In particular, this case study is interesting in the dynamic scalability of this application to meet the needs of an organization on the front lines during a crisis.
GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sengupta, Dipanjan; Agarwal, Kapil; Song, Shuaiwen

2015-09-30

Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device’s internal memory capacity. GraphReduce adopts a combination of both edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the hostmore » and the device.« less
Multiple switching modes and multiple level states in memristive devices

NASA Astrophysics Data System (ADS)

Miao, Feng; Yang, J. Joshua; Borghetti, Julien; Strachan, John Paul; Zhang, M.-X.; Goldfarb, Ilan; Medeiros-Ribeiro, Gilberto; Williams, R. Stanley

2011-03-01

As one of the most promising technologies for next generation non-volatile memory, metal oxide based memristive devices have demonstrated great advantages on scalability, operating speed and power consumption. Here we report the observation of multiple switching modes and multiple level states in different memristive systems. The multiple switching modes can be obtained by limiting the current during electroforming, and related transport behaviors, including ionic and electronic motions, are characterized. Such observation can be rationalized by a model of two effective switching layers adjacent to the bottom and top electrodes. Multiple level states, corresponding to different composition of the conducting channel, will also be discussed in the context of multiple-level storage for high density, non-volatile memory applications.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Chen, Chao; Pouransari, Hadi; Rajamanickam, Sivasankaran

We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it exploits the low-rank structure of fill-in blocks. Depending on the accuracy of low-rank approximations, the hierarchical solver can be used either as a direct solver or as a preconditioner. The parallel algorithm is based on data decomposition and requires only local communication for updating boundary data on every processor. Moreover, the computation-to-communication ratio of the parallel algorithm is approximately the volume-to-surface-area ratio of the subdomain owned by everymore » processor. We also provide various numerical results to demonstrate the versatility and scalability of the parallel algorithm.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Perumalla, Kalyan S.; Yoginath, Srikanth B.

Problems such as fault tolerance and scalable synchronization can be efficiently solved using reversibility of applications. Making applications reversible by relying on computation rather than on memory is ideal for large scale parallel computing, especially for the next generation of supercomputers in which memory is expensive in terms of latency, energy, and price. In this direction, a case study is presented here in reversing a computational core, namely, Basic Linear Algebra Subprograms, which is widely used in scientific applications. A new Reversible BLAS (RBLAS) library interface has been designed, and a prototype has been implemented with two modes: (1) amore » memory-mode in which reversibility is obtained by checkpointing to memory in forward and restoring from memory in reverse, and (2) a computational-mode in which nothing is saved in the forward, but restoration is done entirely via inverse computation in reverse. The article is focused on detailed performance benchmarking to evaluate the runtime dynamics and performance effects, comparing reversible computation with checkpointing on both traditional CPU platforms and recent GPU accelerator platforms. For BLAS Level-1 subprograms, data indicates over an order of magnitude better speed of reversible computation compared to checkpointing. For BLAS Level-2 and Level-3, a more complex tradeoff is observed between reversible computation and checkpointing, depending on computational and memory complexities of the subprograms.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Katti, Amogh; Di Fatta, Giuseppe; Naughton III, Thomas J

Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a fault tolerant failure detection and consensus algorithm. This paper presents and compares two novel failure detection and consensus algorithms. The proposed algorithms are based on Gossip protocols and are inherently fault-tolerant and scalable. The proposed algorithms were implementedmore » and tested using the Extreme-scale Simulator. The results show that in both algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus.« less
Methodology for fast detection of false sharing in threaded scientific codes

DOEpatents

Chung, I-Hsin; Cong, Guojing; Murata, Hiroki; Negishi, Yasushi; Wen, Hui-Fang

2014-11-25

A profiling tool identifies a code region with a false sharing potential. A static analysis tool classifies variables and arrays in the identified code region. A mapping detection library correlates memory access instructions in the identified code region with variables and arrays in the identified code region while a processor is running the identified code region. The mapping detection library identifies one or more instructions at risk, in the identified code region, which are subject to an analysis by a false sharing detection library. A false sharing detection library performs a run-time analysis of the one or more instructions at risk while the processor is re-running the identified code region. The false sharing detection library determines, based on the performed run-time analysis, whether two different portions of the cache memory line are accessed by the generated binary code.
Enhancing Shared Decision Making Through Carefully Designed Interventions That Target Patient And Provider Behavior.

PubMed

Tai-Seale, Ming; Elwyn, Glyn; Wilson, Caroline J; Stults, Cheryl; Dillon, Ellis C; Li, Martina; Chuang, Judith; Meehan, Amy; Frosch, Dominick L

2016-04-01

Patient-provider communication and shared decision making are essential for primary care delivery and are vital contributors to patient experience and health outcomes. To alleviate communication shortfalls, we designed a novel, multidimensional intervention aimed at nudging both patients and primary care providers to communicate more openly. The intervention was tested against an existing intervention, which focused mainly on changing patients' behaviors, in four primary care clinics involving 26 primary care providers and 300 patients. Study results suggest that compared to usual care, both the novel and existing interventions were associated with better patient reports of how well primary care providers engaged them in shared decision making. Future research should build on the work in this pilot to rigorously examine the comparative effectiveness and scalability of these interventions to improve shared decision making at the point of care. Project HOPE—The People-to-People Health Foundation, Inc.
A Generic Ground Framework for Image Expertise Centres and Small-Sized Production Centres

NASA Astrophysics Data System (ADS)

Sellé, A.

2009-05-01

Initiated by the Pleiadas Earth Observation Program, the CNES (French Space Agency) has developed a generic collaborative framework for its image quality centre, highly customisable for any upcoming expertise centre. This collaborative framework has been design to be used by a group of experts or scientists that want to share data and processings and manage interfaces with external entities. Its flexible and scalable architecture complies with the core requirements: defining a user data model with no impact on the software (generic access data), integrating user processings with a GUI builder and built-in APIs, and offering a scalable architecture to fit any preformance requirement and accompany growing projects. The CNES jas given licensing grants for two software companies that will be able to redistribute this framework to any customer.
A kilobyte rewritable atomic memory

NASA Astrophysics Data System (ADS)

Kalff, F. E.; Rebergen, M. P.; Fahrenfort, E.; Girovsky, J.; Toskovic, R.; Lado, J. L.; Fernández-Rossier, J.; Otte, A. F.

2016-11-01

The advent of devices based on single dopants, such as the single-atom transistor, the single-spin magnetometer and the single-atom memory, has motivated the quest for strategies that permit the control of matter with atomic precision. Manipulation of individual atoms by low-temperature scanning tunnelling microscopy provides ways to store data in atoms, encoded either into their charge state, magnetization state or lattice position. A clear challenge now is the controlled integration of these individual functional atoms into extended, scalable atomic circuits. Here, we present a robust digital atomic-scale memory of up to 1 kilobyte (8,000 bits) using an array of individual surface vacancies in a chlorine-terminated Cu(100) surface. The memory can be read and rewritten automatically by means of atomic-scale markers and offers an areal density of 502 terabits per square inch, outperforming state-of-the-art hard disk drives by three orders of magnitude. Furthermore, the chlorine vacancies are found to be stable at temperatures up to 77 K, offering the potential for expanding large-scale atomic assembly towards ambient conditions.
A fast, high-endurance and scalable non-volatile memory device made from asymmetric Ta2O5-x/TaO2-x bilayer structures

NASA Astrophysics Data System (ADS)

Lee, Myoung-Jae; Lee, Chang Bum; Lee, Dongsoo; Lee, Seung Ryul; Chang, Man; Hur, Ji Hyun; Kim, Young-Bae; Kim, Chang-Jung; Seo, David H.; Seo, Sunae; Chung, U.-In; Yoo, In-Kyeong; Kim, Kinam

2011-08-01

Numerous candidates attempting to replace Si-based flash memory have failed for a variety of reasons over the years. Oxide-based resistance memory and the related memristor have succeeded in surpassing the specifications for a number of device requirements. However, a material or device structure that satisfies high-density, switching-speed, endurance, retention and most importantly power-consumption criteria has yet to be announced. In this work we demonstrate a TaOx-based asymmetric passive switching device with which we were able to localize resistance switching and satisfy all aforementioned requirements. In particular, the reduction of switching current drastically reduces power consumption and results in extreme cycling endurances of over 1012. Along with the 10 ns switching times, this allows for possible applications to the working-memory space as well. Furthermore, by combining two such devices each with an intrinsic Schottky barrier we eliminate any need for a discrete transistor or diode in solving issues of stray leakage current paths in high-density crossbar arrays.
Efficient calculation of open quantum system dynamics and time-resolved spectroscopy with distributed memory HEOM (DM-HEOM).

PubMed

Kramer, Tobias; Noack, Matthias; Reinefeld, Alexander; Rodríguez, Mirta; Zelinskyy, Yaroslav

2018-06-11

Time- and frequency-resolved optical signals provide insights into the properties of light-harvesting molecular complexes, including excitation energies, dipole strengths and orientations, as well as in the exciton energy flow through the complex. The hierarchical equations of motion (HEOM) provide a unifying theory, which allows one to study the combined effects of system-environment dissipation and non-Markovian memory without making restrictive assumptions about weak or strong couplings or separability of vibrational and electronic degrees of freedom. With increasing system size the exact solution of the open quantum system dynamics requires memory and compute resources beyond a single compute node. To overcome this barrier, we developed a scalable variant of HEOM. Our distributed memory HEOM, DM-HEOM, is a universal tool for open quantum system dynamics. It is used to accurately compute all experimentally accessible time- and frequency-resolved processes in light-harvesting molecular complexes with arbitrary system-environment couplings for a wide range of temperatures and complex sizes. © 2018 Wiley Periodicals, Inc. © 2018 Wiley Periodicals, Inc.
An Adaptive Insertion and Promotion Policy for Partitioned Shared Caches

NASA Astrophysics Data System (ADS)

Mahrom, Norfadila; Liebelt, Michael; Raof, Rafikha Aliana A.; Daud, Shuhaizar; Hafizah Ghazali, Nur

2018-03-01

Cache replacement policies in chip multiprocessors (CMP) have been investigated extensively and proven able to enhance shared cache management. However, competition among multiple processors executing different threads that require simultaneous access to a shared memory may cause cache contention and memory coherence problems on the chip. These issues also exist due to some drawbacks of the commonly used Least Recently Used (LRU) policy employed in multiprocessor systems, which are because of the cache lines residing in the cache longer than required. In image processing analysis of for example extra pulmonary tuberculosis (TB), an accurate diagnosis for tissue specimen is required. Therefore, a fast and reliable shared memory management system to execute algorithms for processing vast amount of specimen image is needed. In this paper, the effects of the cache replacement policy in a partitioned shared cache are investigated. The goal is to quantify whether better performance can be achieved by using less complex replacement strategies. This paper proposes a Middle Insertion 2 Positions Promotion (MI2PP) policy to eliminate cache misses that could adversely affect the access patterns and the throughput of the processors in the system. The policy employs a static predefined insertion point, near distance promotion, and the concept of ownership in the eviction policy to effectively improve cache thrashing and to avoid resource stealing among the processors.
Hierarchical Traces for Reduced NSM Memory Requirements

NASA Astrophysics Data System (ADS)

Dahl, Torbjørn S.

This paper presents work on using hierarchical long term memory to reduce the memory requirements of nearest sequence memory (NSM) learning, a previously published, instance-based reinforcement learning algorithm. A hierarchical memory representation reduces the memory requirements by allowing traces to share common sub-sequences. We present moderated mechanisms for estimating discounted future rewards and for dealing with hidden state using hierarchical memory. We also present an experimental analysis of how the sub-sequence length affects the memory compression achieved and show that the reduced memory requirements do not effect the speed of learning. Finally, we analyse and discuss the persistence of the sub-sequences independent of specific trace instances.
The Contribution of Working Memory to Fluid Reasoning: Capacity, Control, or Both?

ERIC Educational Resources Information Center

Chuderski, Adam; Necka, Edward

2012-01-01

Fluid reasoning shares a large part of its variance with working memory capacity (WMC). The literature on working memory (WM) suggests that the capacity of the focus of attention responsible for simultaneous maintenance and integration of information within WM, as well as the effectiveness of executive control exerted over WM, determines…
Feature-Based Memory-Driven Attentional Capture: Visual Working Memory Content Affects Visual Attention

ERIC Educational Resources Information Center

Olivers, Christian N. L.; Meijer, Frank; Theeuwes, Jan

2006-01-01

In 7 experiments, the authors explored whether visual attention (the ability to select relevant visual information) and visual working memory (the ability to retain relevant visual information) share the same content representations. The presence of singleton distractors interfered more strongly with a visual search task when it was accompanied by…

Time-Related Decay or Interference-Based Forgetting in Working Memory?

ERIC Educational Resources Information Center

Portrat, Sophie; Barrouillet, Pierre; Camos, Valerie

2008-01-01

The time-based resource-sharing model of working memory assumes that memory traces suffer from a time-related decay when attention is occupied by concurrent activities. Using complex continuous span tasks in which temporal parameters are carefully controlled, P. Barrouillet, S. Bernardin, S. Portrat, E. Vergauwe, & V. Camos (2007) recently…
Developmental Change in Working Memory Strategies: From Passive Maintenance to Active Refreshing

ERIC Educational Resources Information Center

Camos, Valerie; Barrouillet, Pierre

2011-01-01

Change in strategies is often mentioned as a source of memory development. However, though performance in working memory tasks steadily improves during childhood, theories differ in linking this development to strategy changes. Whereas some theories, such as the time-based resource-sharing model, invoke the age-related increase in use and…
Cache write generate for parallel image processing on shared memory architectures.

PubMed

Wittenbrink, C M; Somani, A K; Chen, C H

1996-01-01

We investigate cache write generate, our cache mode invention. We demonstrate that for parallel image processing applications, the new mode improves main memory bandwidth, CPU efficiency, cache hits, and cache latency. We use register level simulations validated by the UW-Proteus system. Many memory, cache, and processor configurations are evaluated.
Mnemonic convergence in social networks: The emergent properties of cognition at a collective level

PubMed Central

Coman, Alin; Momennejad, Ida; Drach, Rae D.; Geana, Andra

2016-01-01

The development of shared memories, beliefs, and norms is a fundamental characteristic of human communities. These emergent outcomes are thought to occur owing to a dynamic system of information sharing and memory updating, which fundamentally depends on communication. Here we report results on the formation of collective memories in laboratory-created communities. We manipulated conversational network structure in a series of real-time, computer-mediated interactions in fourteen 10-member communities. The results show that mnemonic convergence, measured as the degree of overlap among community members’ memories, is influenced by both individual-level information-processing phenomena and by the conversational social network structure created during conversational recall. By studying laboratory-created social networks, we show how large-scale social phenomena (i.e., collective memory) can emerge out of microlevel local dynamics (i.e., mnemonic reinforcement and suppression effects). The social-interactionist approach proposed herein points to optimal strategies for spreading information in social networks and provides a framework for measuring and forging collective memories in communities of individuals. PMID:27357678
Virtual Machine-level Software Transactional Memory: Principles, Techniques, and Implementation

DTIC Science & Technology

2015-08-13

against non-VM STMs, with the same algorithm inside the VM versus “outside”it. Our competitor non-VM STMs include Deuce, ObjectFabric, Multiverse , and...TL2 ByteSTM/NOrec Non-VM/NOrec Deuce/TL2 Object Fabric Multiverse JVSTM (a) 20% writes. (b) 80% writes. Fig. 1 Throughput for Linked-List. Higher is...When Scalability Meets Consistency: Genuine Multiversion Update-Serializable Partial Data Replication. In ICDCS, pages 455–465, 2012. [34] D
PAUSE: Predictive Analytics Using SPARQL-Endpoints

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sukumar, Sreenivas R; Ainsworth, Keela; Bond, Nathaniel

2014-07-11

This invention relates to the medical industry and more specifically to methods of predicting risks. With the impetus towards personalized and evidence-based medicine, the need for a framework to analyze/interpret quantitative measurements (blood work, toxicology, etc.) with qualitative descriptions (specialist reports after reading images, bio-medical knowledgebase, etc.) to predict diagnostic risks is fast emerging. We describe a software solution that leverages hardware for scalable in-memory analytics and applies next-generation semantic query tools on medical data.
Vascular system modeling in parallel environment - distributed and shared memory approaches

PubMed Central

Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne

2011-01-01

The paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages and therefore this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multi-core machines, show that both algorithms provide a significant speedup. PMID:21550891
Embedded ensemble propagation for improving performance, portability, and scalability of uncertainty quantification on emerging computational architectures

DOE PAGES

Phipps, Eric T.; D'Elia, Marta; Edwards, Harold C.; ...

2017-04-18

In this study, quantifying simulation uncertainties is a critical component of rigorous predictive simulation. A key component of this is forward propagation of uncertainties in simulation input data to output quantities of interest. Typical approaches involve repeated sampling of the simulation over the uncertain input data, and can require numerous samples when accurately propagating uncertainties from large numbers of sources. Often simulation processes from sample to sample are similar and much of the data generated from each sample evaluation could be reused. We explore a new method for implementing sampling methods that simultaneously propagates groups of samples together in anmore » embedded fashion, which we call embedded ensemble propagation. We show how this approach takes advantage of properties of modern computer architectures to improve performance by enabling reuse between samples, reducing memory bandwidth requirements, improving memory access patterns, improving opportunities for fine-grained parallelization, and reducing communication costs. We describe a software technique for implementing embedded ensemble propagation based on the use of C++ templates and describe its integration with various scientific computing libraries within Trilinos. We demonstrate improved performance, portability and scalability for the approach applied to the simulation of partial differential equations on a variety of CPU, GPU, and accelerator architectures, including up to 131,072 cores on a Cray XK7 (Titan).« less
Optimal and Scalable Caching for 5G Using Reinforcement Learning of Space-Time Popularities

NASA Astrophysics Data System (ADS)

Sadeghi, Alireza; Sheikholeslami, Fatemeh; Giannakis, Georgios B.

2018-02-01

Small basestations (SBs) equipped with caching units have potential to handle the unprecedented demand growth in heterogeneous networks. Through low-rate, backhaul connections with the backbone, SBs can prefetch popular files during off-peak traffic hours, and service them to the edge at peak periods. To intelligently prefetch, each SB must learn what and when to cache, while taking into account SB memory limitations, the massive number of available contents, the unknown popularity profiles, as well as the space-time popularity dynamics of user file requests. In this work, local and global Markov processes model user requests, and a reinforcement learning (RL) framework is put forth for finding the optimal caching policy when the transition probabilities involved are unknown. Joint consideration of global and local popularity demands along with cache-refreshing costs allow for a simple, yet practical asynchronous caching approach. The novel RL-based caching relies on a Q-learning algorithm to implement the optimal policy in an online fashion, thus enabling the cache control unit at the SB to learn, track, and possibly adapt to the underlying dynamics. To endow the algorithm with scalability, a linear function approximation of the proposed Q-learning scheme is introduced, offering faster convergence as well as reduced complexity and memory requirements. Numerical tests corroborate the merits of the proposed approach in various realistic settings.
Reliable file sharing in distributed operating system using web RTC

NASA Astrophysics Data System (ADS)

Dukiya, Rajesh

2017-12-01

Since, the evolution of distributed operating system, distributed file system is come out to be important part in operating system. P2P is a reliable way in Distributed Operating System for file sharing. It was introduced in 1999, later it became a high research interest topic. Peer to Peer network is a type of network, where peers share network workload and other load related tasks. A P2P network can be a period of time connection, where a bunch of computers connected by a USB (Universal Serial Bus) port to transfer or enable disk sharing i.e. file sharing. Currently P2P requires special network that should be designed in P2P way. Nowadays, there is a big influence of browsers in our life. In this project we are going to study of file sharing mechanism in distributed operating system in web browsers, where we will try to find performance bottlenecks which our research will going to be an improvement in file sharing by performance and scalability in distributed file systems. Additionally, we will discuss the scope of Web Torrent file sharing and free-riding in peer to peer networks.
Investigation of High-k Dielectrics and Metal Gate Electrodes for Non-volatile Memory Applications

NASA Astrophysics Data System (ADS)

Jayanti, Srikant

Due to the increasing demand of non-volatile flash memories in the portable electronics, the device structures need to be scaled down drastically. However, the scalability of traditional floating gate structures beyond 20 nm NAND flash technology node is uncertain. In this regard, the use of metal gates and high-k dielectrics as the gate and interpoly dielectrics respectively, seem to be promising substitutes in order to continue the flash scaling beyond 20nm. Furthermore, research of novel memory structures to overcome the scaling challenges need to be explored. Through this work, the use of high-k dielectrics as IPDs in a memory structure has been studied. For this purpose, IPD process optimization and barrier engineering were explored to determine and improve the memory performance. Specifically, the concept of high-k / low-k barrier engineering was studied in corroboration with simulations. In addition, a novel memory structure comprising a continuous metal floating gate was investigated in combination with high-k blocking oxides. Integration of thin metal FGs and high-k dielectrics into a dual floating gate memory structure to result in both volatile and non-volatile modes of operation has been demonstrated, for plausible application in future unified memory architectures. The electrical characterization was performed on simple MIS/MIM and memory capacitors, fabricated through CMOS compatible processes. Various analytical characterization techniques were done to gain more insight into the material behavior of the layers in the device structure. In the first part of this study, interfacial engineering was investigated by exploring La2O3 as SiO2 scavenging layer. Through the silicate formation, the consumption of low-k SiO2 was controlled and resulted in a significant improvement in dielectric leakage. The performance improvement was also gauged through memory capacitors. In the second part of the study, a novel memory structure consisting of continuous metal FG in the form of PVD TaN was investigated along with high-k blocking dielectric. The material properties of TaN metal and high-k / low-k dielectric engineering were systematically studied. And the resulting memory structures exhibit excellent memory characteristics and scalability of the metal FG down to ˜1nm, which is promising in order to reduce the unwanted FG-FG interferences. In the later part of the study, the thermal stability of the combined stack was examined and various approaches to improve the stability and understand the cause of instability were explored. The performance of the high-k IPD metal FG memory structure was observed to degrade with higher annealing conditions and the deteriorated behavior was attributed to the leakage instability of the high-k /TaN capacitor. While the degradation is pronounced in both MIM and MIS capacitors, a higher leakage increment was seen in MIM, which was attributed to the higher degree of dielectric crystallization. In an attempt to improve the thermal stability, the trade-off in using amorphous interlayers to reduce the enhanced dielectric crystallization on metal was highlighted. Also, the effect of oxygen vacancies and grain growth on the dielectric leakage was studied through a multi-deposition-multi-anneal technique. Multi step deposition and annealing in a more electronegative ambient was observed to have a positive impact on the dielectric performance.
Associative-memory representations emerge as shared spatial patterns of theta activity spanning the primate temporal cortex

PubMed Central

Nakahara, Kiyoshi; Adachi, Ken; Kawasaki, Keisuke; Matsuo, Takeshi; Sawahata, Hirohito; Majima, Kei; Takeda, Masaki; Sugiyama, Sayaka; Nakata, Ryota; Iijima, Atsuhiko; Tanigawa, Hisashi; Suzuki, Takafumi; Kamitani, Yukiyasu; Hasegawa, Isao

2016-01-01

Highly localized neuronal spikes in primate temporal cortex can encode associative memory; however, whether memory formation involves area-wide reorganization of ensemble activity, which often accompanies rhythmicity, or just local microcircuit-level plasticity, remains elusive. Using high-density electrocorticography, we capture local-field potentials spanning the monkey temporal lobes, and show that the visual pair-association (PA) memory is encoded in spatial patterns of theta activity in areas TE, 36, and, partially, in the parahippocampal cortex, but not in the entorhinal cortex. The theta patterns elicited by learned paired associates are distinct between pairs, but similar within pairs. This pattern similarity, emerging through novel PA learning, allows a machine-learning decoder trained on theta patterns elicited by a particular visual item to correctly predict the identity of those elicited by its paired associate. Our results suggest that the formation and sharing of widespread cortical theta patterns via learning-induced reorganization are involved in the mechanisms of associative memory representation. PMID:27282247
The costs of changing an intended action: movement planning, but not execution, interferes with verbal working memory.

PubMed

Spiegel, M A; Koester, D; Weigelt, M; Schack, T

2012-02-16

How much cognitive effort does it take to change a movement plan? In previous studies, it has been shown that humans plan and represent actions in advance, but it remains unclear whether or not action planning and verbal working memory share cognitive resources. Using a novel experimental paradigm, we combined in two experiments a grasp-to-place task with a verbal working memory task. Participants planned a placing movement toward one of two target positions and subsequently encoded and maintained visually presented letters. Both experiments revealed that re-planning the intended action reduced letter recall performance; execution time, however, was not influenced by action modifications. The results of Experiment 2 suggest that the action's interference with verbal working memory arose during the planning rather than the execution phase of the movement. Together, our results strongly suggest that movement planning and verbal working memory share common cognitive resources. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.
Status and Prospects of ZnO-Based Resistive Switching Memory Devices

NASA Astrophysics Data System (ADS)

Simanjuntak, Firman Mangasa; Panda, Debashis; Wei, Kung-Hwa; Tseng, Tseung-Yuen

2016-08-01

In the advancement of the semiconductor device technology, ZnO could be a prospective alternative than the other metal oxides for its versatility and huge applications in different aspects. In this review, a thorough overview on ZnO for the application of resistive switching memory (RRAM) devices has been conducted. Various efforts that have been made to investigate and modulate the switching characteristics of ZnO-based switching memory devices are discussed. The use of ZnO layer in different structure, the different types of filament formation, and the different types of switching including complementary switching are reported. By considering the huge interest of transparent devices, this review gives the concrete overview of the present status and prospects of transparent RRAM devices based on ZnO. ZnO-based RRAM can be used for flexible memory devices, which is also covered here. Another challenge in ZnO-based RRAM is that the realization of ultra-thin and low power devices. Nevertheless, ZnO not only offers decent memory properties but also has a unique potential to be used as multifunctional nonvolatile memory devices. The impact of electrode materials, metal doping, stack structures, transparency, and flexibility on resistive switching properties and switching parameters of ZnO-based resistive switching memory devices are briefly compared. This review also covers the different nanostructured-based emerging resistive switching memory devices for low power scalable devices. It may give a valuable insight on developing ZnO-based RRAM and also should encourage researchers to overcome the challenges.
GOES-R GS Product Generation Infrastructure Operations

NASA Astrophysics Data System (ADS)

Blanton, M.; Gundy, J.

2012-12-01

GOES-R GS Product Generation Infrastructure Operations: The GOES-R Ground System (GS) will produce a much larger set of products with higher data density than previous GOES systems. This requires considerably greater compute and memory resources to achieve the necessary latency and availability for these products. Over time, new algorithms could be added and existing ones removed or updated, but the GOES-R GS cannot go down during this time. To meet these GOES-R GS processing needs, the Harris Corporation will implement a Product Generation (PG) infrastructure that is scalable, extensible, extendable, modular and reliable. The primary parts of the PG infrastructure are the Service Based Architecture (SBA), which includes the Distributed Data Fabric (DDF). The SBA is the middleware that encapsulates and manages science algorithms that generate products. The SBA is divided into three parts, the Executive, which manages and configures the algorithm as a service, the Dispatcher, which provides data to the algorithm, and the Strategy, which determines when the algorithm can execute with the available data. The SBA is a distributed architecture, with services connected to each other over a compute grid and is highly scalable. This plug-and-play architecture allows algorithms to be added, removed, or updated without affecting any other services or software currently running and producing data. Algorithms require product data from other algorithms, so a scalable and reliable messaging is necessary. The SBA uses the DDF to provide this data communication layer between algorithms. The DDF provides an abstract interface over a distributed and persistent multi-layered storage system (memory based caching above disk-based storage) and an event system that allows algorithm services to know when data is available and to get the data that they need to begin processing when they need it. Together, the SBA and the DDF provide a flexible, high performance architecture that can meet the needs of product processing now and as they grow in the future.
Algae-Based Biofuel Distribution System to Service the Department of Defense in Hawaii

DTIC Science & Technology

2013-03-01

reliance on global sources of petroleum fuels by increasing use of alternative fuels. News articles were gathered that contained public statements... markets to reduce shared risks among stakeholders, discussion of scalability potential based on existing biofuels industry capabilities in Hawaii, and...biofuels objective given the growing economies of foreign entities within their operating regions and the highly volatile petroleum market . These
Towards Sustainability and Scalability of Educational Innovations in Hydrology:What is the Value and who is the Customer?

NASA Astrophysics Data System (ADS)

Deshotel, M.; Habib, E. H.

2016-12-01

There is an increasing desire by the water education community to use emerging research resources and technological advances in order to reform current educational practices. Recent years have witnessed some exemplary developments that tap into emerging hydrologic modeling and data sharing resources, innovative digital and visualization technologies, and field experiences. However, such attempts remain largely at the scale of individual efforts and fall short of meeting scalability and sustainability solutions. This can be attributed to number of reasons such as inadequate experience with modeling and data-based educational developments, lack of faculty time to invest in further developments, and lack of resources to further support the project. Another important but often-overlooked reason is the lack of adequate insight on the actual needs of end-users of such developments. Such insight is highly critical to inform how to scale and sustain educational innovations. In this presentation, we share with the hydrologic community experiences gathered from an ongoing experiment where the authors engaged in a hypothesis-driven, customer-discovery process to inform the scalability and sustainability of educational innovations in the field of hydrology and water resources education. The experiment is part of a program called Innovation Corps for Learning (I-Corps L). This program follows a business model approach where a value proposition is initially formulated on the educational innovation. The authors then engaged in a hypothesis-validation process through an intense series of customer interviews with different segments of potential end users, including junior/senior students, student interns, and hydrology professors. The authors also sought insight from engineering firms by interviewing junior engineers and their supervisors to gather feedback on the preparedness of graduating engineers as they enter the workforce in the area of water resources. Exploring the large landscape of potential users is critical in formulating a user-driven approach that can inform the innovation development. The presentation shares the results of this experiment and the insight gained and discusses how such information can inform the community on sustaining and scaling hydrology educational developments.
Synapsin Determines Memory Strength after Punishment- and Relief-Learning

PubMed Central

Niewalda, Thomas; Michels, Birgit; Jungnickel, Roswitha; Diegelmann, Sören; Kleber, Jörg; Kähne, Thilo

2015-01-01

Adverse life events can induce two kinds of memory with opposite valence, dependent on timing: “negative” memories for stimuli preceding them and “positive” memories for stimuli experienced at the moment of “relief.” Such punishment memory and relief memory are found in insects, rats, and man. For example, fruit flies (Drosophila melanogaster) avoid an odor after odor-shock training (“forward conditioning” of the odor), whereas after shock-odor training (“backward conditioning” of the odor) they approach it. Do these timing-dependent associative processes share molecular determinants? We focus on the role of Synapsin, a conserved presynaptic phosphoprotein regulating the balance between the reserve pool and the readily releasable pool of synaptic vesicles. We find that a lack of Synapsin leaves task-relevant sensory and motor faculties unaffected. In contrast, both punishment memory and relief memory scores are reduced. These defects reflect a true lessening of associative memory strength, as distortions in nonassociative processing (e.g., susceptibility to handling, adaptation, habituation, sensitization), discrimination ability, and changes in the time course of coincidence detection can be ruled out as alternative explanations. Reductions in punishment- and relief-memory strength are also observed upon an RNAi-mediated knock-down of Synapsin, and are rescued both by acutely restoring Synapsin and by locally restoring it in the mushroom bodies of mutant flies. Thus, both punishment memory and relief memory require the Synapsin protein and in this sense share genetic and molecular determinants. We note that corresponding molecular commonalities between punishment memory and relief memory in humans would constrain pharmacological attempts to selectively interfere with excessive associative punishment memories, e.g., after traumatic experiences. PMID:25972175
Synapsin determines memory strength after punishment- and relief-learning.

PubMed

Niewalda, Thomas; Michels, Birgit; Jungnickel, Roswitha; Diegelmann, Sören; Kleber, Jörg; Kähne, Thilo; Gerber, Bertram

2015-05-13

Adverse life events can induce two kinds of memory with opposite valence, dependent on timing: "negative" memories for stimuli preceding them and "positive" memories for stimuli experienced at the moment of "relief." Such punishment memory and relief memory are found in insects, rats, and man. For example, fruit flies (Drosophila melanogaster) avoid an odor after odor-shock training ("forward conditioning" of the odor), whereas after shock-odor training ("backward conditioning" of the odor) they approach it. Do these timing-dependent associative processes share molecular determinants? We focus on the role of Synapsin, a conserved presynaptic phosphoprotein regulating the balance between the reserve pool and the readily releasable pool of synaptic vesicles. We find that a lack of Synapsin leaves task-relevant sensory and motor faculties unaffected. In contrast, both punishment memory and relief memory scores are reduced. These defects reflect a true lessening of associative memory strength, as distortions in nonassociative processing (e.g., susceptibility to handling, adaptation, habituation, sensitization), discrimination ability, and changes in the time course of coincidence detection can be ruled out as alternative explanations. Reductions in punishment- and relief-memory strength are also observed upon an RNAi-mediated knock-down of Synapsin, and are rescued both by acutely restoring Synapsin and by locally restoring it in the mushroom bodies of mutant flies. Thus, both punishment memory and relief memory require the Synapsin protein and in this sense share genetic and molecular determinants. We note that corresponding molecular commonalities between punishment memory and relief memory in humans would constrain pharmacological attempts to selectively interfere with excessive associative punishment memories, e.g., after traumatic experiences. Copyright © 2015 Niewalda et al.
Audience tuning effects in the context of situated and embodied processes.

PubMed

Semin, Gün R

2018-03-05

This review provides an overview of the research on communication and the 'Saying is Believing' paradigm in the context of different perspectives on communication. The process of 'audience tuning' is shaped by a variety of situated factors in contexts that affect the communicators' confidence in their message. The overwhelming common denominator is that the combination of features that create ambiguity yields the optimal condition for the formation of shared realities. I conclude with an argument that the implied invariance of memory processes in shared reality work needs to be more attentive to the regulatory function of memories driving the expression of shared realities. Copyright © 2018 Elsevier Ltd. All rights reserved.

A 300MHz Embedded Flash Memory with Pipeline Architecture and Offset-Free Sense Amplifiers for Dual-Core Automotive Microcontrollers

NASA Astrophysics Data System (ADS)

Kajiyama, Shinya; Fujito, Masamichi; Kasai, Hideo; Mizuno, Makoto; Yamaguchi, Takanori; Shinagawa, Yutaka

A novel 300MHz embedded flash memory for dual-core microcontrollers with a shared ROM architecture is proposed. One of its features is a three-stage pipeline read operation, which enables reduced access pitch and therefore reduces performance penalty due to conflict of shared ROM accesses. Another feature is a highly sensitive sense amplifier that achieves efficient pipeline operation with two-cycle latency one-cycle pitch as a result of a shortened sense time of 0.63ns. The combination of the pipeline architecture and proposed sense amplifiers significantly reduces access-conflict penalties with shared ROM and enhances performance of 32-bit RISC dual-core microcontrollers by 30%.
A general model for memory interference in a multiprocessor system with memory hierarchy

NASA Technical Reports Server (NTRS)

Taha, Badie A.; Standley, Hilda M.

1989-01-01

The problem of memory interference in a multiprocessor system with a hierarchy of shared buses and memories is addressed. The behavior of the processors is represented by a sequence of memory requests with each followed by a determined amount of processing time. A statistical queuing network model for determining the extent of memory interference in multiprocessor systems with clusters of memory hierarchies is presented. The performance of the system is measured by the expected number of busy memory clusters. The results of the analytic model are compared with simulation results, and the correlation between them is found to be very high.
Domain-general involvement of the posterior frontolateral cortex in time-based resource-sharing in working memory: An fMRI study.

PubMed

Vergauwe, Evie; Hartstra, Egbert; Barrouillet, Pierre; Brass, Marcel

2015-07-15

Working memory is often defined in cognitive psychology as a system devoted to the simultaneous processing and maintenance of information. In line with the time-based resource-sharing model of working memory (TBRS; Barrouillet and Camos, 2015; Barrouillet et al., 2004), there is accumulating evidence that, when memory items have to be maintained while performing a concurrent activity, memory performance depends on the cognitive load of this activity, independently of the domain involved. The present study used fMRI to identify regions in the brain that are sensitive to variations in cognitive load in a domain-general way. More precisely, we aimed at identifying brain areas that activate during maintenance of memory items as a direct function of the cognitive load induced by both verbal and spatial concurrent tasks. Results show that the right IFJ and bilateral SPL/IPS are the only areas showing an increased involvement as cognitive load increases and do so in a domain general manner. When correlating the fMRI signal with the approximated cognitive load as defined by the TBRS model, it was shown that the main focus of the cognitive load-related activation is located in the right IFJ. The present findings indicate that the IFJ makes domain-general contributions to time-based resource-sharing in working memory and allowed us to generate the novel hypothesis by which the IFJ might be the neural basis for the process of rapid switching. We argue that the IFJ might be a crucial part of a central attentional bottleneck in the brain because of its inability to upload more than one task rule at once. Copyright © 2015 Elsevier Inc. All rights reserved.
Information and processes underlying semantic and episodic memory across tasks, items, and individuals.

PubMed

Cox, Gregory E; Hemmer, Pernille; Aue, William R; Criss, Amy H

2018-04-01

The development of memory theory has been constrained by a focus on isolated tasks rather than the processes and information that are common to situations in which memory is engaged. We present results from a study in which 453 participants took part in five different memory tasks: single-item recognition, associative recognition, cued recall, free recall, and lexical decision. Using hierarchical Bayesian techniques, we jointly analyzed the correlations between tasks within individuals-reflecting the degree to which tasks rely on shared cognitive processes-and within items-reflecting the degree to which tasks rely on the same information conveyed by the item. Among other things, we find that (a) the processes involved in lexical access and episodic memory are largely separate and rely on different kinds of information, (b) access to lexical memory is driven primarily by perceptual aspects of a word, (c) all episodic memory tasks rely to an extent on a set of shared processes which make use of semantic features to encode both single words and associations between words, and (d) recall involves additional processes likely related to contextual cuing and response production. These results provide a large-scale picture of memory across different tasks which can serve to drive the development of comprehensive theories of memory. (PsycINFO Database Record (c) 2018 APA, all rights reserved).
What Drives Memory-Driven Attentional Capture? The Effects of Memory Type, Display Type, and Search Type

ERIC Educational Resources Information Center

Olivers, Christian N. L.

2009-01-01

An important question is whether visual attention (the ability to select relevant visual information) and visual working memory (the ability to retain relevant visual information) share the same content representations. Some past research has indicated that they do: Singleton distractors interfered more strongly with a visual search task when they…
Discrete Resource Allocation in Visual Working Memory

ERIC Educational Resources Information Center

Barton, Brian; Ester, Edward F.; Awh, Edward

2009-01-01

Are resources in visual working memory allocated in a continuous or a discrete fashion? On one hand, flexible resource models suggest that capacity is determined by a central resource pool that can be flexibly divided such that items of greater complexity receive a larger share of resources. On the other hand, if capacity in working memory is…
Natural Conversations as a Source of False Memories in Children: Implications for the Testimony of Young Witnesses

PubMed Central

Principe, Gabrielle F.; Schindewolf, Erica

2012-01-01

Research on factors that can affect the accuracy of children’s autobiographical remembering has important implications for understanding the abilities of young witnesses to provide legal testimony. In this article, we review our own recent research on one factor that has much potential to induce errors in children’s event recall, namely natural memory sharing conversations with peers and parents. Our studies provide compelling evidence that not only can the content of conversations about the past intrude into later memory but that such exchanges can prompt the generation of entirely false narratives that are more detailed than true accounts of experienced events. Further, our work show that deeper and more creative participation in memory sharing dialogues can boost the damaging effects of conversationally conveyed misinformation. Implications of this collection of findings for children’s testimony are discussed. PMID:23129880
On nonlinear finite element analysis in single-, multi- and parallel-processors

NASA Technical Reports Server (NTRS)

Utku, S.; Melosh, R.; Islam, M.; Salama, M.

1982-01-01

Numerical solution of nonlinear equilibrium problems of structures by means of Newton-Raphson type iterations is reviewed. Each step of the iteration is shown to correspond to the solution of a linear problem, therefore the feasibility of the finite element method for nonlinear analysis is established. Organization and flow of data for various types of digital computers, such as single-processor/single-level memory, single-processor/two-level-memory, vector-processor/two-level-memory, and parallel-processors, with and without sub-structuring (i.e. partitioning) are given. The effect of the relative costs of computation, memory and data transfer on substructuring is shown. The idea of assigning comparable size substructures to parallel processors is exploited. Under Cholesky type factorization schemes, the efficiency of parallel processing is shown to decrease due to the occasional shared data, just as that due to the shared facilities.
Fencing direct memory access data transfers in a parallel active messaging interface of a parallel computer

DOEpatents

Blocksome, Michael A.; Mamidala, Amith R.

2013-09-03

Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to segments of shared random access memory through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and a segment of shared memory; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.
Fencing direct memory access data transfers in a parallel active messaging interface of a parallel computer

DOEpatents

Blocksome, Michael A; Mamidala, Amith R

2014-02-11

Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to segments of shared random access memory through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and a segment of shared memory; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.
Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jin, Shuangshuang; Chen, Yousu; Wu, Di

2015-12-09

Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Messagemore » Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.« less
Parallelization of NAS Benchmarks for Shared Memory Multiprocessors

NASA Technical Reports Server (NTRS)

Waheed, Abdul; Yan, Jerry C.; Saini, Subhash (Technical Monitor)

1998-01-01

This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of porting to new generations of high performance computing systems to parallelization tools and compilers. Due to the simplicity of programming shared-memory multiprocessors, compiler developers have provided various facilities to allow the users to exploit parallelism. Native compilers on SGI Origin2000 support multiprocessing directives to allow users to exploit loop-level parallelism in their programs. Additionally, supporting tools can accomplish this process automatically and present the results of parallelization to the users. We experimented with these compiler directives and supporting tools by parallelizing sequential implementation of NAS benchmarks. Results reported in this paper indicate that with minimal effort, the performance gain is comparable with the hand-parallelized, carefully optimized, message-passing implementations of the same benchmarks.
A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases.

PubMed

Jain, Chirag; Dilthey, Alexander; Koren, Sergey; Aluru, Srinivas; Phillippy, Adam M

2018-04-30

Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290 × faster than Burrows-Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes.
Energy Logic (EL): a novel fusion engine of multi-modality multi-agent data/information fusion for intelligent surveillance systems

NASA Astrophysics Data System (ADS)

Rababaah, Haroun; Shirkhodaie, Amir

2009-04-01

The rapidly advancing hardware technology, smart sensors and sensor networks are advancing environment sensing. One major potential of this technology is Large-Scale Surveillance Systems (LS3) especially for, homeland security, battlefield intelligence, facility guarding and other civilian applications. The efficient and effective deployment of LS3 requires addressing number of aspects impacting the scalability of such systems. The scalability factors are related to: computation and memory utilization efficiency, communication bandwidth utilization, network topology (e.g., centralized, ad-hoc, hierarchical or hybrid), network communication protocol and data routing schemes; and local and global data/information fusion scheme for situational awareness. Although, many models have been proposed to address one aspect or another of these issues but, few have addressed the need for a multi-modality multi-agent data/information fusion that has characteristics satisfying the requirements of current and future intelligent sensors and sensor networks. In this paper, we have presented a novel scalable fusion engine for multi-modality multi-agent information fusion for LS3. The new fusion engine is based on a concept we call: Energy Logic. Experimental results of this work as compared to a Fuzzy logic model strongly supported the validity of the new model and inspired future directions for different levels of fusion and different applications.
FPGA-based prototype storage system with phase change memory

NASA Astrophysics Data System (ADS)

Li, Gezi; Chen, Xiaogang; Chen, Bomy; Li, Shunfen; Zhou, Mi; Han, Wenbing; Song, Zhitang

2016-10-01

With the ever-increasing amount of data being stored via social media, mobile telephony base stations, and network devices etc. the database systems face severe bandwidth bottlenecks when moving vast amounts of data from storage to the processing nodes. At the same time, Storage Class Memory (SCM) technologies such as Phase Change Memory (PCM) with unique features like fast read access, high density, non-volatility, byte-addressability, positive response to increasing temperature, superior scalability, and zero standby leakage have changed the landscape of modern computing and storage systems. In such a scenario, we present a storage system called FLEET which can off-load partial or whole SQL queries to the storage engine from CPU. FLEET uses an FPGA rather than conventional CPUs to implement the off-load engine due to its highly parallel nature. We have implemented an initial prototype of FLEET with PCM-based storage. The results demonstrate that significant performance and CPU utilization gains can be achieved by pushing selected query processing components inside in PCM-based storage.
FPGA cluster for high-performance AO real-time control system

NASA Astrophysics Data System (ADS)

Geng, Deli; Goodsell, Stephen J.; Basden, Alastair G.; Dipper, Nigel A.; Myers, Richard M.; Saunter, Chris D.

2006-06-01

Whilst the high throughput and low latency requirements for the next generation AO real-time control systems have posed a significant challenge to von Neumann architecture processor systems, the Field Programmable Gate Array (FPGA) has emerged as a long term solution with high performance on throughput and excellent predictability on latency. Moreover, FPGA devices have highly capable programmable interfacing, which lead to more highly integrated system. Nevertheless, a single FPGA is still not enough: multiple FPGA devices need to be clustered to perform the required subaperture processing and the reconstruction computation. In an AO real-time control system, the memory bandwidth is often the bottleneck of the system, simply because a vast amount of supporting data, e.g. pixel calibration maps and the reconstruction matrix, need to be accessed within a short period. The cluster, as a general computing architecture, has excellent scalability in processing throughput, memory bandwidth, memory capacity, and communication bandwidth. Problems, such as task distribution, node communication, system verification, are discussed.
Parallel computing of a digital hologram and particle searching for microdigital-holographic particle-tracking velocimetry

DOE Office of Scientific and Technical Information (OSTI.GOV)

Satake, Shin-ichi; Kanamori, Hiroyuki; Kunugi, Tomoaki

2007-02-01

We have developed a parallel algorithm for microdigital-holographic particle-tracking velocimetry. The algorithm is used in (1) numerical reconstruction of a particle image computer using a digital hologram, and (2) searching for particles. The numerical reconstruction from the digital hologram makes use of the Fresnel diffraction equation and the FFT (fast Fourier transform),whereas the particle search algorithm looks for local maximum graduation in a reconstruction field represented by a 3D matrix. To achieve high performance computing for both calculations (reconstruction and particle search), two memory partitions are allocated to the 3D matrix. In this matrix, the reconstruction part consists of horizontallymore » placed 2D memory partitions on the x-y plane for the FFT, whereas, the particle search part consists of vertically placed 2D memory partitions set along the z axes.Consequently, the scalability can be obtained for the proportion of processor elements,where the benchmarks are carried out for parallel computation by a SGI Altix machine.« less
Two-Hierarchy Entanglement Swapping for a Linear Optical Quantum Repeater

NASA Astrophysics Data System (ADS)

Xu, Ping; Yong, Hai-Lin; Chen, Luo-Kan; Liu, Chang; Xiang, Tong; Yao, Xing-Can; Lu, He; Li, Zheng-Da; Liu, Nai-Le; Li, Li; Yang, Tao; Peng, Cheng-Zhi; Zhao, Bo; Chen, Yu-Ao; Pan, Jian-Wei

2017-10-01

Quantum repeaters play a significant role in achieving long-distance quantum communication. In the past decades, tremendous effort has been devoted towards constructing a quantum repeater. As one of the crucial elements, entanglement has been created in different memory systems via entanglement swapping. The realization of j -hierarchy entanglement swapping, i.e., connecting quantum memory and further extending the communication distance, is important for implementing a practical quantum repeater. Here, we report the first demonstration of a fault-tolerant two-hierarchy entanglement swapping with linear optics using parametric down-conversion sources. In the experiment, the dominant or most probable noise terms in the one-hierarchy entanglement swapping, which is on the same order of magnitude as the desired state and prevents further entanglement connections, are automatically washed out by a proper design of the detection setting, and the communication distance can be extended. Given suitable quantum memory, our techniques can be directly applied to implementing an atomic ensemble based quantum repeater, and are of significant importance in the scalable quantum information processing.
Two-Hierarchy Entanglement Swapping for a Linear Optical Quantum Repeater.

PubMed

Xu, Ping; Yong, Hai-Lin; Chen, Luo-Kan; Liu, Chang; Xiang, Tong; Yao, Xing-Can; Lu, He; Li, Zheng-Da; Liu, Nai-Le; Li, Li; Yang, Tao; Peng, Cheng-Zhi; Zhao, Bo; Chen, Yu-Ao; Pan, Jian-Wei

2017-10-27

Quantum repeaters play a significant role in achieving long-distance quantum communication. In the past decades, tremendous effort has been devoted towards constructing a quantum repeater. As one of the crucial elements, entanglement has been created in different memory systems via entanglement swapping. The realization of j-hierarchy entanglement swapping, i.e., connecting quantum memory and further extending the communication distance, is important for implementing a practical quantum repeater. Here, we report the first demonstration of a fault-tolerant two-hierarchy entanglement swapping with linear optics using parametric down-conversion sources. In the experiment, the dominant or most probable noise terms in the one-hierarchy entanglement swapping, which is on the same order of magnitude as the desired state and prevents further entanglement connections, are automatically washed out by a proper design of the detection setting, and the communication distance can be extended. Given suitable quantum memory, our techniques can be directly applied to implementing an atomic ensemble based quantum repeater, and are of significant importance in the scalable quantum information processing.
Full Parallel Implementation of an All-Electron Four-Component Dirac-Kohn-Sham Program.

PubMed

Rampino, Sergio; Belpassi, Leonardo; Tarantelli, Francesco; Storchi, Loriano

2014-09-09

A full distributed-memory implementation of the Dirac-Kohn-Sham (DKS) module of the program BERTHA (Belpassi et al., Phys. Chem. Chem. Phys. 2011, 13, 12368-12394) is presented, where the self-consistent field (SCF) procedure is replicated on all the parallel processes, each process working on subsets of the global matrices. The key feature of the implementation is an efficient procedure for switching between two matrix distribution schemes, one (integral-driven) optimal for the parallel computation of the matrix elements and another (block-cyclic) optimal for the parallel linear algebra operations. This approach, making both CPU-time and memory scalable with the number of processors used, virtually overcomes at once both time and memory barriers associated with DKS calculations. Performance, portability, and numerical stability of the code are illustrated on the basis of test calculations on three gold clusters of increasing size, an organometallic compound, and a perovskite model. The calculations are performed on a Beowulf and a BlueGene/Q system.

ABINIT: Plane-Wave-Based Density-Functional Theory on High Performance Computers

NASA Astrophysics Data System (ADS)

Torrent, Marc

2014-03-01

For several years, a continuous effort has been produced to adapt electronic structure codes based on Density-Functional Theory to the future computing architectures. Among these codes, ABINIT is based on a plane-wave description of the wave functions which allows to treat systems of any kind. Porting such a code on petascale architectures pose difficulties related to the many-body nature of the DFT equations. To improve the performances of ABINIT - especially for what concerns standard LDA/GGA ground-state and response-function calculations - several strategies have been followed: A full multi-level parallelisation MPI scheme has been implemented, exploiting all possible levels and distributing both computation and memory. It allows to increase the number of distributed processes and could not be achieved without a strong restructuring of the code. The core algorithm used to solve the eigen problem (``Locally Optimal Blocked Congugate Gradient''), a Blocked-Davidson-like algorithm, is based on a distribution of processes combining plane-waves and bands. In addition to the distributed memory parallelization, a full hybrid scheme has been implemented, using standard shared-memory directives (openMP/openACC) or porting some comsuming code sections to Graphics Processing Units (GPU). As no simple performance model exists, the complexity of use has been increased; the code efficiency strongly depends on the distribution of processes among the numerous levels. ABINIT is able to predict the performances of several process distributions and automatically choose the most favourable one. On the other hand, a big effort has been carried out to analyse the performances of the code on petascale architectures, showing which sections of codes have to be improved; they all are related to Matrix Algebra (diagonalisation, orthogonalisation). The different strategies employed to improve the code scalability will be described. They are based on an exploration of new diagonalization algorithm, as well as the use of external optimized librairies. Part of this work has been supported by the european Prace project (PaRtnership for Advanced Computing in Europe) in the framework of its workpackage 8.
Large-Scale Image Analytics Using Deep Learning

NASA Astrophysics Data System (ADS)

Ganguly, S.; Nemani, R. R.; Basu, S.; Mukhopadhyay, S.; Michaelis, A.; Votava, P.

2014-12-01

High resolution land cover classification maps are needed to increase the accuracy of current Land ecosystem and climate model outputs. Limited studies are in place that demonstrates the state-of-the-art in deriving very high resolution (VHR) land cover products. In addition, most methods heavily rely on commercial softwares that are difficult to scale given the region of study (e.g. continents to globe). Complexities in present approaches relate to (a) scalability of the algorithm, (b) large image data processing (compute and memory intensive), (c) computational cost, (d) massively parallel architecture, and (e) machine learning automation. In addition, VHR satellite datasets are of the order of terabytes and features extracted from these datasets are of the order of petabytes. In our present study, we have acquired the National Agricultural Imaging Program (NAIP) dataset for the Continental United States at a spatial resolution of 1-m. This data comes as image tiles (a total of quarter million image scenes with ~60 million pixels) and has a total size of ~100 terabytes for a single acquisition. Features extracted from the entire dataset would amount to ~8-10 petabytes. In our proposed approach, we have implemented a novel semi-automated machine learning algorithm rooted on the principles of "deep learning" to delineate the percentage of tree cover. In order to perform image analytics in such a granular system, it is mandatory to devise an intelligent archiving and query system for image retrieval, file structuring, metadata processing and filtering of all available image scenes. Using the Open NASA Earth Exchange (NEX) initiative, which is a partnership with Amazon Web Services (AWS), we have developed an end-to-end architecture for designing the database and the deep belief network (following the distbelief computing model) to solve a grand challenge of scaling this process across quarter million NAIP tiles that cover the entire Continental United States. The AWS core components that we use to solve this problem are DynamoDB along with S3 for database query and storage, ElastiCache shared memory architecture for image segmentation, Elastic Map Reduce (EMR) for image feature extraction, and the memory optimized Elastic Cloud Compute (EC2) for the learning algorithm.
A high performance parallel algorithm for 1-D FFT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Agarwal, R.C.; Gustavson, F.G.; Zubair, M.

1994-12-31

In this paper the authors propose a parallel high performance FFT algorithm based on a multi-dimensional formulation. They use this to solve a commonly encountered FFT based kernel on a distributed memory parallel machine, the IBM scalable parallel system, SP1. The kernel requires a forward FFT computation of an input sequence, multiplication of the transformed data by a coefficient array, and finally an inverse FFT computation of the resultant data. They show that the multi-dimensional formulation helps in reducing the communication costs and also improves the single node performance by effectively utilizing the memory system of the node. They implementedmore » this kernel on the IBM SP1 and observed a performance of 1.25 GFLOPS on a 64-node machine.« less
HTM Spatial Pooler With Memristor Crossbar Circuits for Sparse Biometric Recognition.

PubMed

James, Alex Pappachen; Fedorova, Irina; Ibrayev, Timur; Kudithipudi, Dhireesha

2017-06-01

Hierarchical Temporal Memory (HTM) is an online machine learning algorithm that emulates the neo-cortex. The development of a scalable on-chip HTM architecture is an open research area. The two core substructures of HTM are spatial pooler and temporal memory. In this work, we propose a new Spatial Pooler circuit design with parallel memristive crossbar arrays for the 2D columns. The proposed design was validated on two different benchmark datasets, face recognition, and speech recognition. The circuits are simulated and analyzed using a practical memristor device model and 0.18 μm IBM CMOS technology model. The databases AR, YALE, ORL, and UFI, are used to test the performance of the design in face recognition. TIMIT dataset is used for the speech recognition.
Protection of Mission-Critical Applications from Untrusted Execution Environment: Resource Efficient Replication and Migration of Virtual Machines

DTIC Science & Technology

2015-09-28

the performance of log-and- replay can degrade significantly for VMs configured with multiple virtual CPUs, since the shared memory communication...whether based on checkpoint replication or log-and- replay , existing HA ap- proaches use in- memory backups. The backup VM sits in the memory of a...efficiently. 15. SUBJECT TERMS High-availability virtual machines, live migration, memory and traffic overheads, application suspension, Java
Parallel processing for scientific computations

NASA Technical Reports Server (NTRS)

Alkhatib, Hasan S.

1995-01-01

The scope of this project dealt with the investigation of the requirements to support distributed computing of scientific computations over a cluster of cooperative workstations. Various experiments on computations for the solution of simultaneous linear equations were performed in the early phase of the project to gain experience in the general nature and requirements of scientific applications. A specification of a distributed integrated computing environment, DICE, based on a distributed shared memory communication paradigm has been developed and evaluated. The distributed shared memory model facilitates porting existing parallel algorithms that have been designed for shared memory multiprocessor systems to the new environment. The potential of this new environment is to provide supercomputing capability through the utilization of the aggregate power of workstations cooperating in a cluster interconnected via a local area network. Workstations, generally, do not have the computing power to tackle complex scientific applications, making them primarily useful for visualization, data reduction, and filtering as far as complex scientific applications are concerned. There is a tremendous amount of computing power that is left unused in a network of workstations. Very often a workstation is simply sitting idle on a desk. A set of tools can be developed to take advantage of this potential computing power to create a platform suitable for large scientific computations. The integration of several workstations into a logical cluster of distributed, cooperative, computing stations presents an alternative to shared memory multiprocessor systems. In this project we designed and evaluated such a system.
A Hybrid EAV-Relational Model for Consistent and Scalable Capture of Clinical Research Data.

PubMed

Khan, Omar; Lim Choi Keung, Sarah N; Zhao, Lei; Arvanitis, Theodoros N

2014-01-01

Many clinical research databases are built for specific purposes and their design is often guided by the requirements of their particular setting. Not only does this lead to issues of interoperability and reusability between research groups in the wider community but, within the project itself, changes and additions to the system could be implemented using an ad hoc approach, which may make the system difficult to maintain and even more difficult to share. In this paper, we outline a hybrid Entity-Attribute-Value and relational model approach for modelling data, in light of frequently changing requirements, which enables the back-end database schema to remain static, improving the extensibility and scalability of an application. The model also facilitates data reuse. The methods used build on the modular architecture previously introduced in the CURe project.
Cloud computing applications for biomedical science: A perspective.

PubMed

Navale, Vivek; Bourne, Philip E

2018-06-01

Biomedical research has become a digital data-intensive endeavor, relying on secure and scalable computing, storage, and network infrastructure, which has traditionally been purchased, supported, and maintained locally. For certain types of biomedical applications, cloud computing has emerged as an alternative to locally maintained traditional computing approaches. Cloud computing offers users pay-as-you-go access to services such as hardware infrastructure, platforms, and software for solving common biomedical computational problems. Cloud computing services offer secure on-demand storage and analysis and are differentiated from traditional high-performance computing by their rapid availability and scalability of services. As such, cloud services are engineered to address big data problems and enhance the likelihood of data and analytics sharing, reproducibility, and reuse. Here, we provide an introductory perspective on cloud computing to help the reader determine its value to their own research.
Cloud computing applications for biomedical science: A perspective

PubMed Central

2018-01-01

Biomedical research has become a digital data–intensive endeavor, relying on secure and scalable computing, storage, and network infrastructure, which has traditionally been purchased, supported, and maintained locally. For certain types of biomedical applications, cloud computing has emerged as an alternative to locally maintained traditional computing approaches. Cloud computing offers users pay-as-you-go access to services such as hardware infrastructure, platforms, and software for solving common biomedical computational problems. Cloud computing services offer secure on-demand storage and analysis and are differentiated from traditional high-performance computing by their rapid availability and scalability of services. As such, cloud services are engineered to address big data problems and enhance the likelihood of data and analytics sharing, reproducibility, and reuse. Here, we provide an introductory perspective on cloud computing to help the reader determine its value to their own research. PMID:29902176
Perspectives in astrophysical databases

NASA Astrophysics Data System (ADS)

Frailis, Marco; de Angelis, Alessandro; Roberto, Vito

2004-07-01

Astrophysics has become a domain extremely rich of scientific data. Data mining tools are needed for information extraction from such large data sets. This asks for an approach to data management emphasizing the efficiency and simplicity of data access; efficiency is obtained using multidimensional access methods and simplicity is achieved by properly handling metadata. Moreover, clustering and classification techniques on large data sets pose additional requirements in terms of computation and memory scalability and interpretability of results. In this study we review some possible solutions.
Implementing Shared Memory Parallelism in MCBEND

NASA Astrophysics Data System (ADS)

Bird, Adam; Long, David; Dobson, Geoff

2017-09-01

MCBEND is a general purpose radiation transport Monte Carlo code from AMEC Foster Wheelers's ANSWERS® Software Service. MCBEND is well established in the UK shielding community for radiation shielding and dosimetry assessments. The existing MCBEND parallel capability effectively involves running the same calculation on many processors. This works very well except when the memory requirements of a model restrict the number of instances of a calculation that will fit on a machine. To more effectively utilise parallel hardware OpenMP has been used to implement shared memory parallelism in MCBEND. This paper describes the reasoning behind the choice of OpenMP, notes some of the challenges of multi-threading an established code such as MCBEND and assesses the performance of the parallel method implemented in MCBEND.
Memristive effects in oxygenated amorphous carbon nanodevices

NASA Astrophysics Data System (ADS)

Bachmann, T. A.; Koelmans, W. W.; Jonnalagadda, V. P.; Le Gallo, M.; Santini, C. A.; Sebastian, A.; Eleftheriou, E.; Craciun, M. F.; Wright, C. D.

2018-01-01

Computing with resistive-switching (memristive) memory devices has shown much recent progress and offers an attractive route to circumvent the von-Neumann bottleneck, i.e. the separation of processing and memory, which limits the performance of conventional computer architectures. Due to their good scalability and nanosecond switching speeds, carbon-based resistive-switching memory devices could play an important role in this respect. However, devices based on elemental carbon, such as tetrahedral amorphous carbon or ta-C, typically suffer from a low cycling endurance. A material that has proven to be capable of combining the advantages of elemental carbon-based memories with simple fabrication methods and good endurance performance for binary memory applications is oxygenated amorphous carbon, or a-CO x . Here, we examine the memristive capabilities of nanoscale a-CO x devices, in particular their ability to provide the multilevel and accumulation properties that underpin computing type applications. We show the successful operation of nanoscale a-CO x memory cells for both the storage of multilevel states (here 3-level) and for the provision of an arithmetic accumulator. We implement a base-16, or hexadecimal, accumulator and show how such a device can carry out hexadecimal arithmetic and simultaneously store the computed result in the self-same a-CO x cell, all using fast (sub-10 ns) and low-energy (sub-pJ) input pulses.
Towards reversible basic linear algebra subprograms: A performance study

DOE PAGES

Perumalla, Kalyan S.; Yoginath, Srikanth B.

2014-12-06

Problems such as fault tolerance and scalable synchronization can be efficiently solved using reversibility of applications. Making applications reversible by relying on computation rather than on memory is ideal for large scale parallel computing, especially for the next generation of supercomputers in which memory is expensive in terms of latency, energy, and price. In this direction, a case study is presented here in reversing a computational core, namely, Basic Linear Algebra Subprograms, which is widely used in scientific applications. A new Reversible BLAS (RBLAS) library interface has been designed, and a prototype has been implemented with two modes: (1) amore » memory-mode in which reversibility is obtained by checkpointing to memory in forward and restoring from memory in reverse, and (2) a computational-mode in which nothing is saved in the forward, but restoration is done entirely via inverse computation in reverse. The article is focused on detailed performance benchmarking to evaluate the runtime dynamics and performance effects, comparing reversible computation with checkpointing on both traditional CPU platforms and recent GPU accelerator platforms. For BLAS Level-1 subprograms, data indicates over an order of magnitude better speed of reversible computation compared to checkpointing. For BLAS Level-2 and Level-3, a more complex tradeoff is observed between reversible computation and checkpointing, depending on computational and memory complexities of the subprograms.« less
Hierarchically Self-Assembled Block Copolymer Blends for Templating Hollow Phase-Change Nanostructures with an Extremely Low Switching Current

DOE PAGES

Park, Woon Ik; Kim, Jong Min; Jeong, Jae Won; ...

2015-03-17

Phase change memory (PCM) is one of the most promising candidates for next-generation nonvolatile memory devices because of its high speed, excellent reliability, and outstanding scalability. But, the high switching current of PCM devices has been a critical hurdle to realize low-power operation. Although one solution is to reduce the switching volume of the memory, the resolution limit of photolithography hinders further miniaturization of device dimensions. Here, we employed unconventional self-assembly geometries obtained from blends of block copolymers (BCPs) to form ring-shaped hollow PCM nanostructures with an ultrasmall contact area between a phase-change material (Ge 2Sb 2Te 5) and amore » heater (TiN) electrode. The high-density (approximately 0.1 terabits per square inch) PCM nanoring arrays showed extremely small switching current of 2-3 mu A. Furthermore, the relatively small reset current of the ring-shaped PCM compared to the pillar-shaped devices is attributed to smaller switching volume, which is well supported by electro-thermal simulation results. Our approach may also be extended to other nonvolatile memory device applications such as resistive switching memory and magnetic storage devices, where the control of nanoscale geometry can significantly affect device performances.« less
Interfacial Redox Reactions Associated Ionic Transport in Oxide-Based Memories.

PubMed

Younis, Adnan; Chu, Dewei; Shah, Abdul Hadi; Du, Haiwei; Li, Sean

2017-01-18

As an alternative to transistor-based flash memories, redox reactions mediated resistive switches are considered as the most promising next-generation nonvolatile memories that combine the advantages of a simple metal/solid electrolyte (insulator)/metal structure, high scalability, low power consumption, and fast processing. For cation-based memories, the unavailability of in-built mobile cations in many solid electrolytes/insulators (e.g., Ta 2 O 5 , SiO 2 , etc.) instigates the essential role of absorbed water in films to keep electroneutrality for redox reactions at counter electrodes. Herein, we demonstrate electrochemical characteristics (oxidation/reduction reactions) of active electrodes (Ag and Cu) at the electrode/electrolyte interface and their subsequent ions transportation in Fe 3 O 4 film by means of cyclic voltammetry measurements. By posing positive potentials on Ag/Cu active electrodes, Ag preferentially oxidized to Ag + , while Cu prefers to oxidize into Cu 2+ first, followed by Cu/Cu + oxidation. By sweeping the reverse potential, the oxidized ions can be subsequently reduced at the counter electrode. The results presented here provide a detailed understanding of the resistive switching phenomenon in Fe 3 O 4 -based memory cells. The results were further discussed on the basis of electrochemically assisted cations diffusions in the presence of absorbed surface water molecules in the film.
Space-Filling Supercapacitor Carpets: Highly scalable fractal architecture for energy storage

NASA Astrophysics Data System (ADS)

Tiliakos, Athanasios; Trefilov, Alexandra M. I.; Tanasǎ, Eugenia; Balan, Adriana; Stamatin, Ioan

2018-04-01

Revamping ground-breaking ideas from fractal geometry, we propose an alternative micro-supercapacitor configuration realized by laser-induced graphene (LIG) foams produced via laser pyrolysis of inexpensive commercial polymers. The Space-Filling Supercapacitor Carpet (SFSC) architecture introduces the concept of nested electrodes based on the pre-fractal Peano space-filling curve, arranged in a symmetrical equilateral setup that incorporates multiple parallel capacitor cells sharing common electrodes for maximum efficiency and optimal length-to-area distribution. We elucidate on the theoretical foundations of the SFSC architecture, and we introduce innovations (high-resolution vector-mode printing) in the LIG method that allow for the realization of flexible and scalable devices based on low iterations of the Peano algorithm. SFSCs exhibit distributed capacitance properties, leading to capacitance, energy, and power ratings proportional to the number of nested electrodes (up to 4.3 mF, 0.4 μWh, and 0.2 mW for the largest tested model of low iteration using aqueous electrolytes), with competitively high energy and power densities. This can pave the road for full scalability in energy storage, reaching beyond the scale of micro-supercapacitors for incorporating into larger and more demanding applications.
Scalable tuning of building models to hourly data

DOE PAGES

Garrett, Aaron; New, Joshua Ryan

2015-03-31

Energy models of existing buildings are unreliable unless calibrated so they correlate well with actual energy usage. Manual tuning requires a skilled professional, is prohibitively expensive for small projects, imperfect, non-repeatable, non-transferable, and not scalable to the dozens of sensor channels that smart meters, smart appliances, and cheap/ubiquitous sensors are beginning to make available today. A scalable, automated methodology is needed to quickly and intelligently calibrate building energy models to all available data, increase the usefulness of those models, and facilitate speed-and-scale penetration of simulation-based capabilities into the marketplace for actualized energy savings. The "Autotune'' project is a novel, model-agnosticmore » methodology which leverages supercomputing, large simulation ensembles, and big data mining with multiple machine learning algorithms to allow automatic calibration of simulations that match measured experimental data in a way that is deployable on commodity hardware. This paper shares several methodologies employed to reduce the combinatorial complexity to a computationally tractable search problem for hundreds of input parameters. Furthermore, accuracy metrics are provided which quantify model error to measured data for either monthly or hourly electrical usage from a highly-instrumented, emulated-occupancy research home.« less
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics

NASA Astrophysics Data System (ADS)

Wilson, B. D.; Palamuttam, R. S.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; Verma, R.; Waliser, D. E.; Lee, H.

2015-12-01

Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark under a NASA AIST grant (PI Mattmann). Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based ApacheTM Hadoop by 100x in memory and by 10x on disk. SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 10 to 1000 compute nodes. This 2nd generation capability for NASA's Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes. We have implemented a parallel data ingest capability in which the user specifies desired variables (arrays) as several time-sorted lists of URL's (i.e. using OPeNDAP model.nc?varname, or local files). The specified variables are partitioned by time/space and then each Spark node pulls its bundle of arrays into memory to begin a computation pipeline. We also investigated the performance of several N-dim. array libraries (scala breeze, java jblas & netlib-java, and ND4J). We are currently developing science codes using ND4J and studying memory behavior on the JVM. On the pyspark side, many of our science codes already use the numpy and SciPy ecosystems. The talk will cover: the architecture of SciSpark, the design of the scientific RDD (sRDD) data structure, our efforts to integrate climate science algorithms in Python and Scala, parallel ingest and partitioning of A-Train satellite observations from HDF files and model grids from netCDF files, first parallel runs to compute comparison statistics and PDF's, and first metrics quantifying parallel speedups and memory & disk usage.
Effects of Aging on True and False Memory Formation: An fMRI Study

ERIC Educational Resources Information Center

Dennis, Nancy A.; Kim, Hongkeun; Cabeza, Roberto

2007-01-01

Compared to young, older adults are more likely to forget events that occurred in the past as well as remember events that never happened. Previous studies examining false memories and aging have shown that these memories are more likely to occur when new items share perceptual or semantic similarities with those presented during encoding. It is…
Ad Hoc Categories and False Memories: Memory Illusions for Categories Created On-The-Spot

ERIC Educational Resources Information Center

Soro, Jerônimo C.; Ferreira, Mário B.; Semin, Gün R.; Mata, André; Carneiro, Paula

2017-01-01

Three experiments were designed to test whether experimentally created ad hoc associative networks evoke false memories. We used the DRM (Deese, Roediger, McDermott) paradigm with lists of ad hoc categories composed of exemplars aggregated toward specific goals (e.g., going for a picnic) that do not share any consistent set of features. Experiment…

Transactive memory in organizational groups: the effects of content, consensus, specialization, and accuracy on group performance.

PubMed

Austin, John R

2003-10-01

Previous research on transactive memory has found a positive relationship between transactive memory system development and group performance in single project laboratory and ad hoc groups. Closely related research on shared mental models and expertise recognition supports these findings. In this study, the author examined the relationship between transactive memory systems and performance in mature, continuing groups. A group's transactive memory system, measured as a combination of knowledge stock, knowledge specialization, transactive memory consensus, and transactive memory accuracy, is positively related to group goal performance, external group evaluations, and internal group evaluations. The positive relationship with group performance was found to hold for both task and external relationship transactive memory systems.
Virtual memory support for distributed computing environments using a shared data object model

NASA Astrophysics Data System (ADS)

Huang, F.; Bacon, J.; Mapp, G.

1995-12-01

Conventional storage management systems provide one interface for accessing memory segments and another for accessing secondary storage objects. This hinders application programming and affects overall system performance due to mandatory data copying and user/kernel boundary crossings, which in the microkernel case may involve context switches. Memory-mapping techniques may be used to provide programmers with a unified view of the storage system. This paper extends such techniques to support a shared data object model for distributed computing environments in which good support for coherence and synchronization is essential. The approach is based on a microkernel, typed memory objects, and integrated coherence control. A microkernel architecture is used to support multiple coherence protocols and the addition of new protocols. Memory objects are typed and applications can choose the most suitable protocols for different types of object to avoid protocol mismatch. Low-level coherence control is integrated with high-level concurrency control so that the number of messages required to maintain memory coherence is reduced and system-wide synchronization is realized without severely impacting the system performance. These features together contribute a novel approach to the support for flexible coherence under application control.
Extensible Interest Management for Scalable Persistent Distributed Virtual Environments

DTIC Science & Technology

1999-12-01

Calvin, Cebula et al. 1995; Morse, Bic et al. 2000) uses a two grid, with each grid cell having two multicast addresses. An entity expresses interest...Entity distribution for experimental runs 78 s I * • ...... ^..... * * a» Sis*«*»* 1 ***** Jj |r...Multiple Users and Shared Applications with VRML. VRML 97, Monterey, CA. pp. 33-40. Calvin, J. O., D. P. Cebula , et al. (1995). Data Subscription in
Building Communities of Engineers to Share Technical Expertise

NASA Technical Reports Server (NTRS)

Topousis, Daria E.; Dennehy, Cornelius J.; Fesq, Lorraine M.

2012-01-01

Developed by the core community to describe our vision of an approach to ensure a sufficiently technically advanced and affordable AR&D technology base is available to support future NASA missions. The goal of this strategy is to create an environment exploiting reusable technology elements for an AR&D system design and development process which is: a) Lower-Risk. b) More Versatile/Scalable. c) Reliable & Crew-Safe. d) More Affordable.
Social Transmission of False Memory in Small Groups and Large Networks.

PubMed

Maswood, Raeya; Rajaram, Suparna

2018-05-21

Sharing information and memories is a key feature of social interactions, making social contexts important for developing and transmitting accurate memories and also false memories. False memory transmission can have wide-ranging effects, including shaping personal memories of individuals as well as collective memories of a network of people. This paper reviews a collection of key findings and explanations in cognitive research on the transmission of false memories in small groups. It also reviews the emerging experimental work on larger networks and collective false memories. Given the reconstructive nature of memory, the abundance of misinformation in everyday life, and the variety of social structures in which people interact, an understanding of transmission of false memories has both scientific and societal implications. © 2018 Cognitive Science Society, Inc.
Use Computer-Aided Tools to Parallelize Large CFD Applications

NASA Technical Reports Server (NTRS)

Jin, H.; Frumkin, M.; Yan, J.

2000-01-01

Porting applications to high performance parallel computers is always a challenging task. It is time consuming and costly. With rapid progressing in hardware architectures and increasing complexity of real applications in recent years, the problem becomes even more sever. Today, scalability and high performance are mostly involving handwritten parallel programs using message-passing libraries (e.g. MPI). However, this process is very difficult and often error-prone. The recent reemergence of shared memory parallel (SMP) architectures, such as the cache coherent Non-Uniform Memory Access (ccNUMA) architecture used in the SGI Origin 2000, show good prospects for scaling beyond hundreds of processors. Programming on an SMP is simplified by working in a globally accessible address space. The user can supply compiler directives, such as OpenMP, to parallelize the code. As an industry standard for portable implementation of parallel programs for SMPs, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran, C and C++ to express shared memory parallelism. It promises an incremental path for parallel conversion of existing software, as well as scalability and performance for a complete rewrite or an entirely new development. Perhaps the main disadvantage of programming with directives is that inserted directives may not necessarily enhance performance. In the worst cases, it can create erroneous results. While vendors have provided tools to perform error-checking and profiling, automation in directive insertion is very limited and often failed on large programs, primarily due to the lack of a thorough enough data dependence analysis. To overcome the deficiency, we have developed a toolkit, CAPO, to automatically insert OpenMP directives in Fortran programs and apply certain degrees of optimization. CAPO is aimed at taking advantage of detailed inter-procedural dependence analysis provided by CAPTools, developed by the University of Greenwich, to reduce potential errors made by users. Earlier tests on NAS Benchmarks and ARC3D have demonstrated good success of this tool. In this study, we have applied CAPO to parallelize three large applications in the area of computational fluid dynamics (CFD): OVERFLOW, TLNS3D and INS3D. These codes are widely used for solving Navier-Stokes equations with complicated boundary conditions and turbulence model in multiple zones. Each one comprises of from 50K to 1,00k lines of FORTRAN77. As an example, CAPO took 77 hours to complete the data dependence analysis of OVERFLOW on a workstation (SGI, 175MHz, R10K processor). A fair amount of effort was spent on correcting false dependencies due to lack of necessary knowledge during the analysis. Even so, CAPO provides an easy way for user to interact with the parallelization process. The OpenMP version was generated within a day after the analysis was completed. Due to sequential algorithms involved, code sections in TLNS3D and INS3D need to be restructured by hand to produce more efficient parallel codes. An included figure shows preliminary test results of the generated OVERFLOW with several test cases in single zone. The MPI data points for the small test case were taken from a handcoded MPI version. As we can see, CAPO's version has achieved 18 fold speed up on 32 nodes of the SGI O2K. For the small test case, it outperformed the MPI version. These results are very encouraging, but further work is needed. For example, although CAPO attempts to place directives on the outer- most parallel loops in an interprocedural framework, it does not insert directives based on the best manual strategy. In particular, it lacks the support of parallelization at the multi-zone level. Future work will emphasize on the development of methodology to work in a multi-zone level and with a hybrid approach. Development of tools to perform more complicated code transformation is also needed.
A Cloud-based Infrastructure and Architecture for Environmental System Research

NASA Astrophysics Data System (ADS)

Wang, D.; Wei, Y.; Shankar, M.; Quigley, J.; Wilson, B. E.

2016-12-01

The present availability of high-capacity networks, low-cost computers and storage devices, and the widespread adoption of hardware virtualization and service-oriented architecture provide a great opportunity to enable data and computing infrastructure sharing between closely related research activities. By taking advantage of these approaches, along with the world-class high computing and data infrastructure located at Oak Ridge National Laboratory, a cloud-based infrastructure and architecture has been developed to efficiently deliver essential data and informatics service and utilities to the environmental system research community, and will provide unique capabilities that allows terrestrial ecosystem research projects to share their software utilities (tools), data and even data submission workflow in a straightforward fashion. The infrastructure will minimize large disruptions from current project-based data submission workflows for better acceptances from existing projects, since many ecosystem research projects already have their own requirements or preferences for data submission and collection. The infrastructure will eliminate scalability problems with current project silos by provide unified data services and infrastructure. The Infrastructure consists of two key components (1) a collection of configurable virtual computing environments and user management systems that expedite data submission and collection from environmental system research community, and (2) scalable data management services and system, originated and development by ORNL data centers.
HTMT-class Latency Tolerant Parallel Architecture for Petaflops Scale Computation

NASA Technical Reports Server (NTRS)

Sterling, Thomas; Bergman, Larry

2000-01-01

Computational Aero Sciences and other numeric intensive computation disciplines demand computing throughputs substantially greater than the Teraflops scale systems only now becoming available. The related fields of fluids, structures, thermal, combustion, and dynamic controls are among the interdisciplinary areas that in combination with sufficient resolution and advanced adaptive techniques may force performance requirements towards Petaflops. This will be especially true for compute intensive models such as Navier-Stokes are or when such system models are only part of a larger design optimization computation involving many design points. Yet recent experience with conventional MPP configurations comprising commodity processing and memory components has shown that larger scale frequently results in higher programming difficulty and lower system efficiency. While important advances in system software and algorithms techniques have had some impact on efficiency and programmability for certain classes of problems, in general it is unlikely that software alone will resolve the challenges to higher scalability. As in the past, future generations of high-end computers may require a combination of hardware architecture and system software advances to enable efficient operation at a Petaflops level. The NASA led HTMT project has engaged the talents of a broad interdisciplinary team to develop a new strategy in high-end system architecture to deliver petaflops scale computing in the 2004/5 timeframe. The Hybrid-Technology, MultiThreaded parallel computer architecture incorporates several advanced technologies in combination with an innovative dynamic adaptive scheduling mechanism to provide unprecedented performance and efficiency within practical constraints of cost, complexity, and power consumption. The emerging superconductor Rapid Single Flux Quantum electronics can operate at 100 GHz (the record is 770 GHz) and one percent of the power required by convention semiconductor logic. Wave Division Multiplexing optical communications can approach a peak per fiber bandwidth of 1 Tbps and the new Data Vortex network topology employing this technology can connect tens of thousands of ports providing a bi-section bandwidth on the order of a Petabyte per second with latencies well below 100 nanoseconds, even under heavy loads. Processor-in-Memory (PIM) technology combines logic and memory on the same chip exposing the internal bandwidth of the memory row buffers at low latency. And holographic storage photorefractive storage technologies provide high-density memory with access a thousand times faster than conventional disk technologies. Together these technologies enable a new class of shared memory system architecture with a peak performance in the range of a Petaflops but size and power requirements comparable to today's largest Teraflops scale systems. To achieve high-sustained performance, HTMT combines an advanced multithreading processor architecture with a memory-driven coarse-grained latency management strategy called "percolation", yielding high efficiency while reducing the much of the parallel programming burden. This paper will present the basic system architecture characteristics made possible through this series of advanced technologies and then give a detailed description of the new percolation approach to runtime latency management.
MULTI: a shared memory approach to cooperative molecular modeling.

PubMed

Darden, T; Johnson, P; Smith, H

1991-03-01

A general purpose molecular modeling system, MULTI, based on the UNIX shared memory and semaphore facilities for interprocess communication is described. In addition to the normal querying or monitoring of geometric data, MULTI also provides processes for manipulating conformations, and for displaying peptide or nucleic acid ribbons, Connolly surfaces, close nonbonded contacts, crystal-symmetry related images, least-squares superpositions, and so forth. This paper outlines the basic techniques used in MULTI to ensure cooperation among these specialized processes, and then describes how they can work together to provide a flexible modeling environment.
SMT-Aware Instantaneous Footprint Optimization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Roy, Probir; Liu, Xu; Song, Shuaiwen

Modern architectures employ simultaneous multithreading (SMT) to increase thread-level parallelism. SMT threads share many functional units and the whole memory hierarchy of a physical core. Without a careful code design, SMT threads can easily contend with each other for these shared resources, causing severe performance degradation. Minimizing SMT thread contention for HPC applications running on dedicated platforms is very challenging, because they usually spawn threads within Single Program Multiple Data (SPMD) models. To address this important issue, we introduce a simple scheme for SMT-aware code optimization, which aims to reduce the memory contention across SMT threads.
Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures

NASA Technical Reports Server (NTRS)

Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Caubet, Jordi; Biegel, Bryan A. (Technical Monitor)

2002-01-01

In this paper we describe how to apply powerful performance analysis techniques to understand the behavior of multilevel parallel applications. We use the Paraver/OMPItrace performance analysis system for our study. This system consists of two major components: The OMPItrace dynamic instrumentation mechanism, which allows the tracing of processes and threads and the Paraver graphical user interface for inspection and analyses of the generated traces. We describe how to use the system to conduct a detailed comparative study of a benchmark code implemented in five different programming paradigms applicable for shared memory
Cache-based error recovery for shared memory multiprocessor systems

NASA Technical Reports Server (NTRS)

Wu, Kun-Lung; Fuchs, W. Kent; Patel, Janak H.

1989-01-01

A multiprocessor cache-based checkpointing and recovery scheme for of recovering from transient processor errors in a shared-memory multiprocessor with private caches is presented. New implementation techniques that use checkpoint identifiers and recovery stacks to reduce performance degradation in processor utilization during normal execution are examined. This cache-based checkpointing technique prevents rollback propagation, provides for rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions that take error latency into account are presented.
Power/Performance Trade-offs of Small Batched LU Based Solvers on GPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Villa, Oreste; Fatica, Massimiliano; Gawande, Nitin A.

In this paper we propose and analyze a set of batched linear solvers for small matrices on Graphic Processing Units (GPUs), evaluating the various alternatives depending on the size of the systems to solve. We discuss three different solutions that operate with different level of parallelization and GPU features. The first, exploiting the CUBLAS library, manages matrices of size up to 32x32 and employs Warp level (one matrix, one Warp) parallelism and shared memory. The second works at Thread-block level parallelism (one matrix, one Thread-block), still exploiting shared memory but managing matrices up to 76x76. The third is Thread levelmore » parallel (one matrix, one thread) and can reach sizes up to 128x128, but it does not exploit shared memory and only relies on the high memory bandwidth of the GPU. The first and second solution only support partial pivoting, the third one easily supports partial and full pivoting, making it attractive to problems that require greater numerical stability. We analyze the trade-offs in terms of performance and power consumption as function of the size of the linear systems that are simultaneously solved. We execute the three implementations on a Tesla M2090 (Fermi) and on a Tesla K20 (Kepler).« less
A neuropsychological comparison of obsessive-compulsive disorder and trichotillomania.

PubMed

Chamberlain, Samuel R; Fineberg, Naomi A; Blackwell, Andrew D; Clark, Luke; Robbins, Trevor W; Sahakian, Barbara J

2007-03-02

Obsessive-compulsive disorder (OCD) and trichotillomania (compulsive hair-pulling) share overlapping co-morbidity, familial transmission, and phenomenology. However, the extent to which these disorders share a common cognitive phenotype has yet to be elucidated using patients without confounding co-morbidities. To compare neurocognitive functioning in co-morbidity-free patients with OCD and trichotillomania, focusing on domains of learning and memory, executive function, affective processing, reflection-impulsivity and decision-making. Twenty patients with OCD, 20 patients with trichotillomania, and 20 matched controls undertook neuropsychological assessment after meeting stringent inclusion criteria. Groups were matched for age, education, verbal IQ, and gender. The OCD and trichotillomania groups were impaired on spatial working memory. Only OCD patients showed additional impairments on executive planning and visual pattern recognition memory, and missed more responses to sad target words than other groups on an affective go/no-go task. Furthermore, OCD patients failed to modulate their behaviour between conditions on the reflection-impulsivity test, suggestive of cognitive inflexibility. Both clinical groups showed intact decision-making and probabilistic reversal learning. OCD and trichotillomania shared overlapping spatial working memory problems, but neuropsychological dysfunction in OCD spanned additional domains that were intact in trichotillomania. Findings are discussed in relation to likely fronto-striatal neural substrates and future research directions.
A Scalable Multicore Architecture With Heterogeneous Memory Structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs).

PubMed

Moradi, Saber; Qiao, Ning; Stefanini, Fabio; Indiveri, Giacomo

2018-02-01

Neuromorphic computing systems comprise networks of neurons that use asynchronous events for both computation and communication. This type of representation offers several advantages in terms of bandwidth and power consumption in neuromorphic electronic systems. However, managing the traffic of asynchronous events in large scale systems is a daunting task, both in terms of circuit complexity and memory requirements. Here, we present a novel routing methodology that employs both hierarchical and mesh routing strategies and combines heterogeneous memory structures for minimizing both memory requirements and latency, while maximizing programming flexibility to support a wide range of event-based neural network architectures, through parameter configuration. We validated the proposed scheme in a prototype multicore neuromorphic processor chip that employs hybrid analog/digital circuits for emulating synapse and neuron dynamics together with asynchronous digital circuits for managing the address-event traffic. We present a theoretical analysis of the proposed connectivity scheme, describe the methods and circuits used to implement such scheme, and characterize the prototype chip. Finally, we demonstrate the use of the neuromorphic processor with a convolutional neural network for the real-time classification of visual symbols being flashed to a dynamic vision sensor (DVS) at high speed.
Distributed memory parallel Markov random fields using graph partitioning

DOE Office of Scientific and Technical Information (OSTI.GOV)

Heinemann, C.; Perciano, T.; Ushizima, D.

Markov random fields (MRF) based algorithms have attracted a large amount of interest in image analysis due to their ability to exploit contextual information about data. Image data generated by experimental facilities, though, continues to grow larger and more complex, making it more difficult to analyze in a reasonable amount of time. Applying image processing algorithms to large datasets requires alternative approaches to circumvent performance problems. Aiming to provide scientists with a new tool to recover valuable information from such datasets, we developed a general purpose distributed memory parallel MRF-based image analysis framework (MPI-PMRF). MPI-PMRF overcomes performance and memory limitationsmore » by distributing data and computations across processors. The proposed approach was successfully tested with synthetic and experimental datasets. Additionally, the performance of the MPI-PMRF framework is analyzed through a detailed scalability study. We show that a performance increase is obtained while maintaining an accuracy of the segmentation results higher than 98%. The contributions of this paper are: (a) development of a distributed memory MRF framework; (b) measurement of the performance increase of the proposed approach; (c) verification of segmentation accuracy in both synthetic and experimental, real-world datasets« less
Quantum teleportation between remote atomic-ensemble quantum memories.

PubMed

Bao, Xiao-Hui; Xu, Xiao-Fan; Li, Che-Ming; Yuan, Zhen-Sheng; Lu, Chao-Yang; Pan, Jian-Wei

2012-12-11

Quantum teleportation and quantum memory are two crucial elements for large-scale quantum networks. With the help of prior distributed entanglement as a "quantum channel," quantum teleportation provides an intriguing means to faithfully transfer quantum states among distant locations without actual transmission of the physical carriers [Bennett CH, et al. (1993) Phys Rev Lett 70(13):1895-1899]. Quantum memory enables controlled storage and retrieval of fast-flying photonic quantum bits with stationary matter systems, which is essential to achieve the scalability required for large-scale quantum networks. Combining these two capabilities, here we realize quantum teleportation between two remote atomic-ensemble quantum memory nodes, each composed of ∼10(8) rubidium atoms and connected by a 150-m optical fiber. The spin wave state of one atomic ensemble is mapped to a propagating photon and subjected to Bell state measurements with another single photon that is entangled with the spin wave state of the other ensemble. Two-photon detection events herald the success of teleportation with an average fidelity of 88(7)%. Besides its fundamental interest as a teleportation between two remote macroscopic objects, our technique may be useful for quantum information transfer between different nodes in quantum networks and distributed quantum computing.
On the Suitability of MPI as a PGAS Runtime

DOE Office of Scientific and Technical Information (OSTI.GOV)

Daily, Jeffrey A.; Vishnu, Abhinav; Palmer, Bruce J.

2014-12-18

Partitioned Global Address Space (PGAS) models are emerging as a popular alternative to MPI models for designing scalable applications. At the same time, MPI remains a ubiquitous communication subsystem due to its standardization, high performance, and availability on leading platforms. In this paper, we explore the suitability of using MPI as a scalable PGAS communication subsystem. We focus on the Remote Memory Access (RMA) communication in PGAS models which typically includes {\\em get, put,} and {\\em atomic memory operations}. We perform an in-depth exploration of design alternatives based on MPI. These alternatives include using a semantically-matching interface such as MPI-RMA,more » as well as not-so-intuitive interfaces such as MPI two-sided with a combination of multi-threading and dynamic process management. With an in-depth exploration of these alternatives and their shortcomings, we propose a novel design which is facilitated by the data-centric view in PGAS models. This design leverages a combination of highly tuned MPI two-sided semantics and an automatic, user-transparent split of MPI communicators to provide asynchronous progress. We implement the asynchronous progress ranks approach and other approaches within the Communication Runtime for Exascale which is a communication subsystem for Global Arrays. Our performance evaluation spans pure communication benchmarks, graph community detection and sparse matrix-vector multiplication kernels, and a computational chemistry application. The utility of our proposed PR-based approach is demonstrated by a 2.17x speed-up on 1008 processors over the other MPI-based designs.« less
A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations

PubMed Central

Ho, ThienLuan; Oh, Seung-Rohk

2017-01-01

Approximate string matching with k-differences has a number of practical applications, ranging from pattern recognition to computational biology. This paper proposes an efficient memory-access algorithm for parallel approximate string matching with k-differences on Graphics Processing Units (GPUs). In the proposed algorithm, all threads in the same GPUs warp share data using warp-shuffle operation instead of accessing the shared memory. Moreover, we implement the proposed algorithm by exploiting the memory structure of GPUs to optimize its performance. Experiment results for real DNA packages revealed that the performance of the proposed algorithm and its implementation archived up to 122.64 and 1.53 times compared to that of sequential algorithm on CPU and previous parallel approximate string matching algorithm on GPUs, respectively. PMID:29016700
Scalable focused ion beam creation of nearly lifetime-limited single quantum emitters in diamond nanostructures

PubMed Central

Schröder, Tim; Trusheim, Matthew E.; Walsh, Michael; Li, Luozhou; Zheng, Jiabao; Schukraft, Marco; Sipahigil, Alp; Evans, Ruffin E.; Sukachev, Denis D.; Nguyen, Christian T.; Pacheco, Jose L.; Camacho, Ryan M.; Bielejec, Edward S.; Lukin, Mikhail D.; Englund, Dirk

2017-01-01

The controlled creation of defect centre—nanocavity systems is one of the outstanding challenges for efficiently interfacing spin quantum memories with photons for photon-based entanglement operations in a quantum network. Here we demonstrate direct, maskless creation of atom-like single silicon vacancy (SiV) centres in diamond nanostructures via focused ion beam implantation with ∼32 nm lateral precision and <50 nm positioning accuracy relative to a nanocavity. We determine the Si+ ion to SiV centre conversion yield to be ∼2.5% and observe a 10-fold conversion yield increase by additional electron irradiation. Low-temperature spectroscopy reveals inhomogeneously broadened ensemble emission linewidths of ∼51 GHz and close to lifetime-limited single-emitter transition linewidths down to 126±13 MHz corresponding to ∼1.4 times the natural linewidth. This method for the targeted generation of nearly transform-limited quantum emitters should facilitate the development of scalable solid-state quantum information processors. PMID:28548097

Scalable focused ion beam creation of nearly lifetime-limited single quantum emitters in diamond nanostructures

DOE PAGES

Schroder, Tim; Trusheim, Matthew E.; Walsh, Michael; ...

2017-05-26

The controlled creation of defect centre—nanocavity systems is one of the outstanding challenges for efficiently interfacing spin quantum memories with photons for photon-based entanglement operations in a quantum network. Here we demonstrate direct, maskless creation of atom-like single silicon vacancy (SiV) centres in diamond nanostructures via focused ion beam implantation with ~32 nm lateral precision and <50 nm positioning accuracy relative to a nanocavity. We determine the Si+ ion to SiV centre conversion yield to be ~2.5% and observe a 10-fold conversion yield increase by additional electron irradiation. Low-temperature spectroscopy reveals inhomogeneously broadened ensemble emission linewidths of ~51 GHz andmore » close to lifetime-limited single-emitter transition linewidths down to 126±13 MHz corresponding to ~1.4 times the natural linewidth. Furthermore, this method for the targeted generation of nearly transform-limited quantum emitters should facilitate the development of scalable solid-state quantum information processors.« less
Scalable focused ion beam creation of nearly lifetime-limited single quantum emitters in diamond nanostructures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schroder, Tim; Trusheim, Matthew E.; Walsh, Michael

The controlled creation of defect centre—nanocavity systems is one of the outstanding challenges for efficiently interfacing spin quantum memories with photons for photon-based entanglement operations in a quantum network. Here we demonstrate direct, maskless creation of atom-like single silicon vacancy (SiV) centres in diamond nanostructures via focused ion beam implantation with ~32 nm lateral precision and <50 nm positioning accuracy relative to a nanocavity. We determine the Si+ ion to SiV centre conversion yield to be ~2.5% and observe a 10-fold conversion yield increase by additional electron irradiation. Low-temperature spectroscopy reveals inhomogeneously broadened ensemble emission linewidths of ~51 GHz andmore » close to lifetime-limited single-emitter transition linewidths down to 126±13 MHz corresponding to ~1.4 times the natural linewidth. Furthermore, this method for the targeted generation of nearly transform-limited quantum emitters should facilitate the development of scalable solid-state quantum information processors.« less
A High-Speed Design of Montgomery Multiplier

NASA Astrophysics Data System (ADS)

Fan, Yibo; Ikenaga, Takeshi; Goto, Satoshi

With the increase of key length used in public cryptographic algorithms such as RSA and ECC, the speed of Montgomery multiplication becomes a bottleneck. This paper proposes a high speed design of Montgomery multiplier. Firstly, a modified scalable high-radix Montgomery algorithm is proposed to reduce critical path. Secondly, a high-radix clock-saving dataflow is proposed to support high-radix operation and one clock cycle delay in dataflow. Finally, a hardware-reused architecture is proposed to reduce the hardware cost and a parallel radix-16 design of data path is proposed to accelerate the speed. By using HHNEC 0.25μm standard cell library, the implementation results show that the total cost of Montgomery multiplier is 130 KGates, the clock frequency is 180MHz and the throughput of 1024-bit RSA encryption is 352kbps. This design is suitable to be used in high speed RSA or ECC encryption/decryption. As a scalable design, it supports any key-length encryption/decryption up to the size of on-chip memory.
Scalable service architecture for providing strong service guarantees

NASA Astrophysics Data System (ADS)

Christin, Nicolas; Liebeherr, Joerg

2002-07-01

For the past decade, a lot of Internet research has been devoted to providing different levels of service to applications. Initial proposals for service differentiation provided strong service guarantees, with strict bounds on delays, loss rates, and throughput, but required high overhead in terms of computational complexity and memory, both of which raise scalability concerns. Recently, the interest has shifted to service architectures with low overhead. However, these newer service architectures only provide weak service guarantees, which do not always address the needs of applications. In this paper, we describe a service architecture that supports strong service guarantees, can be implemented with low computational complexity, and only requires to maintain little state information. A key mechanism of the proposed service architecture is that it addresses scheduling and buffer management in a single algorithm. The presented architecture offers no solution for controlling the amount of traffic that enters the network. Instead, we plan on exploiting feedback mechanisms of TCP congestion control algorithms for the purpose of regulating the traffic entering the network.
Relative time sharing: new findings and an extension of the resource allocation model of temporal processing.

PubMed

Buhusi, Catalin V; Meck, Warren H

2009-07-12

Individuals time as if using a stopwatch that can be stopped or reset on command. Here, we review behavioural and neurobiological data supporting the time-sharing hypothesis that perceived time depends on the attentional and memory resources allocated to the timing process. Neuroimaging studies in humans suggest that timekeeping tasks engage brain circuits typically involved in attention and working memory. Behavioural, pharmacological, lesion and electrophysiological studies in lower animals support this time-sharing hypothesis. When subjects attend to a second task, or when intruder events are presented, estimated durations are shorter, presumably due to resources being taken away from timing. Here, we extend the time-sharing hypothesis by proposing that resource reallocation is proportional to the perceived contrast, both in temporal and non-temporal features, between intruders and the timed events. New findings support this extension by showing that the effect of an intruder event is dependent on the relative duration of the intruder to the intertrial interval. The conclusion is that the brain circuits engaged by timekeeping comprise not only those primarily involved in time accumulation, but also those involved in the maintenance of attentional and memory resources for timing, and in the monitoring and reallocation of those resources among tasks.
Level-2 Milestone 5588: Deliver Strategic Plan and Initial Scalability Assessment by Advanced Architecture and Portability Specialists Team

DOE Office of Scientific and Technical Information (OSTI.GOV)

Draeger, Erik W.

This report documents the fact that the work in creating a strategic plan and beginning customer engagements has been completed. The description of milestone is: The newly formed advanced architecture and portability specialists (AAPS) team will develop a strategic plan to meet the goals of 1) sharing knowledge and experience with code teams to ensure that ASC codes run well on new architectures, and 2) supplying skilled computational scientists to put the strategy into practice. The plan will be delivered to ASC management in the first quarter. By the fourth quarter, the team will identify their first customers within PEMmore » and IC, perform an initial assessment and scalability and performance bottleneck for next-generation architectures, and embed AAPS team members with customer code teams to assist with initial portability development within standalone kernels or proxy applications.« less
Elevated-Confined Phase-Change Random Access Memory Cells

NASA Astrophysics Data System (ADS)

Lee; Koon, Hock; Shi; Luping; Zhao; Rong; Yang; Hongxin; Lim; Guan, Kian; Li; Jianming; Chong; Chong, Tow

2010-04-01

A new elevated-confined phase-change random access memory (PCRAM) cell structure to reduce power consumption was proposed. In this proposed structure, the confined phase-change region is sitting on top of a small metal column enclosed by a dielectric at the sides. Hence, more heat can be effectively sustained underneath the phase-change region. As for the conventional structure, the confined phase-change region is sitting directly above a large planar bottom metal electrode, which can easily conduct most of the induced heat away. From simulations, a more uniform temperature profile around the active region and a higher peak temperature at the phase-change layer (PCL) in an elevated-confined structure were observed. Experimental results showed that the elevated-confined PCRAM cell requires a lower programming power and has a better scalability than a conventional confined PCRAM cell.
An experimental distributed microprocessor implementation with a shared memory communications and control medium

NASA Technical Reports Server (NTRS)

Mejzak, R. S.

1980-01-01

The distributed processing concept is defined in terms of control primitives, variables, and structures and their use in performing a decomposed discrete Fourier transform (DET) application function. The design assumes interprocessor communications to be anonymous. In this scheme, all processors can access an entire common database by employing control primitives. Access to selected areas within the common database is random, enforced by a hardware lock, and determined by task and subtask pointers. This enables the number of processors to be varied in the configuration without any modifications to the control structure. Decompositional elements of the DFT application function in terms of tasks and subtasks are also described. The experimental hardware configuration consists of IMSAI 8080 chassis which are independent, 8 bit microcomputer units. These chassis are linked together to form a multiple processing system by means of a shared memory facility. This facility consists of hardware which provides a bus structure to enable up to six microcomputers to be interconnected. It provides polling and arbitration logic so that only one processor has access to shared memory at any one time.
Running ATLAS workloads within massively parallel distributed applications using Athena Multi-Process framework (AthenaMP)

NASA Astrophysics Data System (ADS)

Calafiura, Paolo; Leggett, Charles; Seuster, Rolf; Tsulaia, Vakhtang; Van Gemmeren, Peter

2015-12-01

AthenaMP is a multi-process version of the ATLAS reconstruction, simulation and data analysis framework Athena. By leveraging Linux fork and copy-on-write mechanisms, it allows for sharing of memory pages between event processors running on the same compute node with little to no change in the application code. Originally targeted to optimize the memory footprint of reconstruction jobs, AthenaMP has demonstrated that it can reduce the memory usage of certain configurations of ATLAS production jobs by a factor of 2. AthenaMP has also evolved to become the parallel event-processing core of the recently developed ATLAS infrastructure for fine-grained event processing (Event Service) which allows the running of AthenaMP inside massively parallel distributed applications on hundreds of compute nodes simultaneously. We present the architecture of AthenaMP, various strategies implemented by AthenaMP for scheduling workload to worker processes (for example: Shared Event Queue and Shared Distributor of Event Tokens) and the usage of AthenaMP in the diversity of ATLAS event processing workloads on various computing resources: Grid, opportunistic resources and HPC.
Optimized Infrastructure for the Earth System Prediction Capability

DTIC Science & Technology

2013-09-30

for referencing memory between its native coupling datatype (MCT Attribute Vectors) and ESMF Arrays. This will reduce the copies required and will...introduced ability within CESM to share memory between ESMF and MCT datatypes makes using both tools together much easier. Using both is appealing
Infectious Cognition: Risk Perception Affects Socially Shared Retrieval-Induced Forgetting of Medical Information.

PubMed

Coman, Alin; Berry, Jessica N

2015-12-01

When speakers selectively retrieve previously learned information, listeners often concurrently, and covertly, retrieve their memories of that information. This concurrent retrieval typically enhances memory for mentioned information (the rehearsal effect) and impairs memory for unmentioned but related information (socially shared retrieval-induced forgetting, SSRIF), relative to memory for unmentioned and unrelated information. Building on research showing that anxiety leads to increased attention to threat-relevant information, we explored whether concurrent retrieval is facilitated in high-anxiety real-world contexts. Participants first learned category-exemplar facts about meningococcal disease. Following a manipulation of perceived risk of infection (low vs. high risk), they listened to a mock radio show in which some of the facts were selectively practiced. Final recall tests showed that the rehearsal effect was equivalent between the two risk conditions, but SSRIF was significantly larger in the high-risk than in the low-risk condition. Thus, the tendency to exaggerate consequences of news events was found to have deleterious consequences. © The Author(s) 2015.
Fencing direct memory access data transfers in a parallel active messaging interface of a parallel computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blocksome, Michael A.; Mamidala, Amith R.

2013-09-03

Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to segments of shared random access memory through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and a segmentmore » of shared memory; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.« less
Phase-change materials for non-volatile memory devices: from technological challenges to materials science issues

NASA Astrophysics Data System (ADS)

Noé, Pierre; Vallée, Christophe; Hippert, Françoise; Fillot, Frédéric; Raty, Jean-Yves

2018-01-01

Chalcogenide phase-change materials (PCMs), such as Ge-Sb-Te alloys, have shown outstanding properties, which has led to their successful use for a long time in optical memories (DVDs) and, recently, in non-volatile resistive memories. The latter, known as PCM memories or phase-change random access memories (PCRAMs), are the most promising candidates among emerging non-volatile memory (NVM) technologies to replace the current FLASH memories at CMOS technology nodes under 28 nm. Chalcogenide PCMs exhibit fast and reversible phase transformations between crystalline and amorphous states with very different transport and optical properties leading to a unique set of features for PCRAMs, such as fast programming, good cyclability, high scalability, multi-level storage capability, and good data retention. Nevertheless, PCM memory technology has to overcome several challenges to definitively invade the NVM market. In this review paper, we examine the main technological challenges that PCM memory technology must face and we illustrate how new memory architecture, innovative deposition methods, and PCM composition optimization can contribute to further improvements of this technology. In particular, we examine how to lower the programming currents and increase data retention. Scaling down PCM memories for large-scale integration means the incorporation of the PCM into more and more confined structures and raises materials science issues in order to understand interface and size effects on crystallization. Other materials science issues are related to the stability and ageing of the amorphous state of PCMs. The stability of the amorphous phase, which determines data retention in memory devices, can be increased by doping the PCM. Ageing of the amorphous phase leads to a large increase of the resistivity with time (resistance drift), which has up to now hindered the development of ultra-high multi-level storage devices. A review of the current understanding of all these issues is provided from a materials science point of view.
Highly Scalable Asynchronous Computing Method for Partial Differential Equations: A Path Towards Exascale

NASA Astrophysics Data System (ADS)

Konduri, Aditya

Many natural and engineering systems are governed by nonlinear partial differential equations (PDEs) which result in a multiscale phenomena, e.g. turbulent flows. Numerical simulations of these problems are computationally very expensive and demand for extreme levels of parallelism. At realistic conditions, simulations are being carried out on massively parallel computers with hundreds of thousands of processing elements (PEs). It has been observed that communication between PEs as well as their synchronization at these extreme scales take up a significant portion of the total simulation time and result in poor scalability of codes. This issue is likely to pose a bottleneck in scalability of codes on future Exascale systems. In this work, we propose an asynchronous computing algorithm based on widely used finite difference methods to solve PDEs in which synchronization between PEs due to communication is relaxed at a mathematical level. We show that while stability is conserved when schemes are used asynchronously, accuracy is greatly degraded. Since message arrivals at PEs are random processes, so is the behavior of the error. We propose a new statistical framework in which we show that average errors drop always to first-order regardless of the original scheme. We propose new asynchrony-tolerant schemes that maintain accuracy when synchronization is relaxed. The quality of the solution is shown to depend, not only on the physical phenomena and numerical schemes, but also on the characteristics of the computing machine. A novel algorithm using remote memory access communications has been developed to demonstrate excellent scalability of the method for large-scale computing. Finally, we present a path to extend this method in solving complex multi-scale problems on Exascale machines.
The Work's Not Over- Roll Up Your Sleeves and Make a Difference!

NASA Astrophysics Data System (ADS)

Sarquis, Mickey

1997-01-01

As my 17-year tenure as the first editor of the Secondary School Chemistry Section draws to a close, John Moore has invited me to share some reflections on my experiences. It's hard for me to believe that this many years have passed; in some ways, it seems like only yesterday that I took on this position. Looking back over my term as Section editor recalls wonderful memories, but it also stimulates me to seek out and take on new challenges as I move into a new phase of involvement in chemical education. In response to John's kind invitation, I'd like to share some of these memories and ideas with you who share my vision of quality chemical education, particularly at the secondary level.
Arousal-biased competition in perception and memory

PubMed Central

Mather, Mara; Sutherland, Matthew R.

2010-01-01

Our everyday surroundings besiege us with information. The battle is for a share of our limited attention and memory, with the brain selecting the winners and discarding the losers. Previous research shows that both bottom-up and top-down factors bias competition in favor of high priority stimuli. We propose that arousal during an event increases this bias both in perception and in long-term memory of the event. Arousal-biased competition theory provides specific predictions about when arousal will enhance and when it will impair memory for events, accounting for some puzzling contradictions in the emotional memory literature. PMID:21660127
Glucocorticoids in the prefrontal cortex enhance memory consolidation and impair working memory by a common neural mechanism

PubMed Central

Barsegyan, Areg; Mackenzie, Scott M.; Kurose, Brian D.; McGaugh, James L.; Roozendaal, Benno

2010-01-01

It is well established that acute administration of adrenocortical hormones enhances the consolidation of memories of emotional experiences and, concurrently, impairs working memory. These different glucocorticoid effects on these two memory functions have generally been considered to be independently regulated processes. Here we report that a glucocorticoid receptor agonist administered into the medial prefrontal cortex (mPFC) of male Sprague-Dawley rats both enhances memory consolidation and impairs working memory. Both memory effects are mediated by activation of a membrane-bound steroid receptor and depend on noradrenergic activity within the mPFC to increase levels of cAMP-dependent protein kinase. These findings provide direct evidence that glucocorticoid effects on both memory consolidation and working memory share a common neural influence within the mPFC. PMID:20810923
Benefits and Costs of Context Reinstatement in Episodic Memory: An ERP Study.

PubMed

Bramão, Inês; Johansson, Mikael

2017-01-01

This study investigated context-dependent episodic memory retrieval. An influential idea in the memory literature is that performance benefits when the retrieval context overlaps with the original encoding context. However, such memory facilitation may not be driven by the encoding-retrieval overlap per se but by the presence of diagnostic features in the reinstated context that discriminate the target episode from competing episodes. To test this prediction, the encoding-retrieval overlap and the diagnostic value of the context were manipulated in a novel associative recognition memory task. Participants were asked to memorize word pairs presented together with diagnostic (unique) and nondiagnostic (shared) background scenes. At test, participants recognized the word pairs in the presence and absence of the previously encoded contexts. Behavioral data show facilitated memory performance in the presence of the original context but, importantly, only when the context was diagnostic of the target episode. The electrophysiological data reveal an early anterior ERP encoding-retrieval overlap effect that tracks the cost associated with having nondiagnostic contexts present at retrieval, that is, shared by multiple previous episodes, and a later posterior encoding-retrieval overlap effect that reflects facilitated access to the target episode during retrieval in diagnostic contexts. Taken together, our results underscore the importance of the diagnostic value of the context and suggest that context-dependent episodic memory effects are multiple determined.
Computational scalability of large size image dissemination

NASA Astrophysics Data System (ADS)

Kooper, Rob; Bajcsy, Peter

2011-01-01

We have investigated the computational scalability of image pyramid building needed for dissemination of very large image data. The sources of large images include high resolution microscopes and telescopes, remote sensing and airborne imaging, and high resolution scanners. The term 'large' is understood from a user perspective which means either larger than a display size or larger than a memory/disk to hold the image data. The application drivers for our work are digitization projects such as the Lincoln Papers project (each image scan is about 100-150MB or about 5000x8000 pixels with the total number to be around 200,000) and the UIUC library scanning project for historical maps from 17th and 18th century (smaller number but larger images). The goal of our work is understand computational scalability of the web-based dissemination using image pyramids for these large image scans, as well as the preservation aspects of the data. We report our computational benchmarks for (a) building image pyramids to be disseminated using the Microsoft Seadragon library, (b) a computation execution approach using hyper-threading to generate image pyramids and to utilize the underlying hardware, and (c) an image pyramid preservation approach using various hard drive configurations of Redundant Array of Independent Disks (RAID) drives for input/output operations. The benchmarks are obtained with a map (334.61 MB, JPEG format, 17591x15014 pixels). The discussion combines the speed and preservation objectives.
An O(N) and parallel approach to integral problems by a kernel-independent fast multipole method: Application to polarization and magnetization of interacting particles

NASA Astrophysics Data System (ADS)

Jiang, Xikai; Li, Jiyuan; Zhao, Xujun; Qin, Jian; Karpeev, Dmitry; Hernandez-Ortiz, Juan; de Pablo, Juan J.; Heinonen, Olle

2016-08-01

Large classes of materials systems in physics and engineering are governed by magnetic and electrostatic interactions. Continuum or mesoscale descriptions of such systems can be cast in terms of integral equations, whose direct computational evaluation requires O(N2) operations, where N is the number of unknowns. Such a scaling, which arises from the many-body nature of the relevant Green's function, has precluded wide-spread adoption of integral methods for solution of large-scale scientific and engineering problems. In this work, a parallel computational approach is presented that relies on using scalable open source libraries and utilizes a kernel-independent Fast Multipole Method (FMM) to evaluate the integrals in O(N) operations, with O(N) memory cost, thereby substantially improving the scalability and efficiency of computational integral methods. We demonstrate the accuracy, efficiency, and scalability of our approach in the context of two examples. In the first, we solve a boundary value problem for a ferroelectric/ferromagnetic volume in free space. In the second, we solve an electrostatic problem involving polarizable dielectric bodies in an unbounded dielectric medium. The results from these test cases show that our proposed parallel approach, which is built on a kernel-independent FMM, can enable highly efficient and accurate simulations and allow for considerable flexibility in a broad range of applications.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.