Static Memory Deduplication for Performance Optimization in Cloud Computing.
Jia, Gangyong; Han, Guangjie; Wang, Hao; Yang, Xuan
2017-04-27
In a cloud computing environment, the number of virtual machines (VMs) on a single physical server and the number of applications running on each VM are continuously growing. This has led to an enormous increase in the demand of memory capacity and subsequent increase in the energy consumption in the cloud. Lack of enough memory has become a major bottleneck for scalability and performance of virtualization interfaces in cloud computing. To address this problem, memory deduplication techniques which reduce memory demand through page sharing are being adopted. However, such techniques suffer from overheads in terms of number of online comparisons required for the memory deduplication. In this paper, we propose a static memory deduplication (SMD) technique which can reduce memory capacity requirement and provide performance optimization in cloud computing. The main innovation of SMD is that the process of page detection is performed offline, thus potentially reducing the performance cost, especially in terms of response time. In SMD, page comparisons are restricted to the code segment, which has the highest shared content. Our experimental results show that SMD efficiently reduces memory capacity requirement and improves performance. We demonstrate that, compared to other approaches, the cost in terms of the response time is negligible.
Static Memory Deduplication for Performance Optimization in Cloud Computing
Jia, Gangyong; Han, Guangjie; Wang, Hao; Yang, Xuan
2017-01-01
In a cloud computing environment, the number of virtual machines (VMs) on a single physical server and the number of applications running on each VM are continuously growing. This has led to an enormous increase in the demand of memory capacity and subsequent increase in the energy consumption in the cloud. Lack of enough memory has become a major bottleneck for scalability and performance of virtualization interfaces in cloud computing. To address this problem, memory deduplication techniques which reduce memory demand through page sharing are being adopted. However, such techniques suffer from overheads in terms of number of online comparisons required for the memory deduplication. In this paper, we propose a static memory deduplication (SMD) technique which can reduce memory capacity requirement and provide performance optimization in cloud computing. The main innovation of SMD is that the process of page detection is performed offline, thus potentially reducing the performance cost, especially in terms of response time. In SMD, page comparisons are restricted to the code segment, which has the highest shared content. Our experimental results show that SMD efficiently reduces memory capacity requirement and improves performance. We demonstrate that, compared to other approaches, the cost in terms of the response time is negligible. PMID:28448434
40 CFR 1042.110 - Recording reductant use and other diagnostic functions.
Code of Federal Regulations, 2010 CFR
2010-07-01
...) The onboard computer log must record in nonvolatile computer memory all incidents of engine operation... such operation in nonvolatile computer memory. You are not required to monitor NOX concentrations...
FFTs in external or hierarchical memory
NASA Technical Reports Server (NTRS)
Bailey, David H.
1989-01-01
A description is given of advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) use strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation. Performance figures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the Cray-2, the Cray X-MP, and the Cray Y-MP systems. Using all eight processors on the Cray Y-MP, this main memory routine runs at nearly 2 Gflops.
Leahy, P.P.
1982-01-01
The Trescott computer program for modeling groundwater flow in three dimensions has been modified to (1) treat aquifer and confining bed pinchouts more realistically and (2) reduce the computer memory requirements needed for the input data. Using the original program, simulation of aquifer systems with nonrectangular external boundaries may result in a large number of nodes that are not involved in the numerical solution of the problem, but require computer storage. (USGS)
Provably unbounded memory advantage in stochastic simulation using quantum mechanics
NASA Astrophysics Data System (ADS)
Garner, Andrew J. P.; Liu, Qing; Thompson, Jayne; Vedral, Vlatko; Gu, mile
2017-10-01
Simulating the stochastic evolution of real quantities on a digital computer requires a trade-off between the precision to which these quantities are approximated, and the memory required to store them. The statistical accuracy of the simulation is thus generally limited by the internal memory available to the simulator. Here, using tools from computational mechanics, we show that quantum processors with a fixed finite memory can simulate stochastic processes of real variables to arbitrarily high precision. This demonstrates a provable, unbounded memory advantage that a quantum simulator can exhibit over its best possible classical counterpart.
High efficiency coherent optical memory with warm rubidium vapour
Hosseini, M.; Sparkes, B.M.; Campbell, G.; Lam, P.K.; Buchler, B.C.
2011-01-01
By harnessing aspects of quantum mechanics, communication and information processing could be radically transformed. Promising forms of quantum information technology include optical quantum cryptographic systems and computing using photons for quantum logic operations. As with current information processing systems, some form of memory will be required. Quantum repeaters, which are required for long distance quantum key distribution, require quantum optical memory as do deterministic logic gates for optical quantum computing. Here, we present results from a coherent optical memory based on warm rubidium vapour and show 87% efficient recall of light pulses, the highest efficiency measured to date for any coherent optical memory suitable for quantum information applications. We also show storage and recall of up to 20 pulses from our system. These results show that simple warm atomic vapour systems have clear potential as a platform for quantum memory. PMID:21285952
High efficiency coherent optical memory with warm rubidium vapour.
Hosseini, M; Sparkes, B M; Campbell, G; Lam, P K; Buchler, B C
2011-02-01
By harnessing aspects of quantum mechanics, communication and information processing could be radically transformed. Promising forms of quantum information technology include optical quantum cryptographic systems and computing using photons for quantum logic operations. As with current information processing systems, some form of memory will be required. Quantum repeaters, which are required for long distance quantum key distribution, require quantum optical memory as do deterministic logic gates for optical quantum computing. Here, we present results from a coherent optical memory based on warm rubidium vapour and show 87% efficient recall of light pulses, the highest efficiency measured to date for any coherent optical memory suitable for quantum information applications. We also show storage and recall of up to 20 pulses from our system. These results show that simple warm atomic vapour systems have clear potential as a platform for quantum memory.
Development of 3-Year Roadmap to Transform the Discipline of Systems Engineering
2010-03-31
quickly humans could physically construct them. Indeed, magnetic core memory was entirely constructed by human hands until it was superseded by...For their mainframe computers, IBM develops the applications, operating system, computer hardware and microprocessors (off the shelf standard memory ...processor developers work on potential computational and memory pipelines to support the required performance capabilities and use the available transistors
Rapid solution of large-scale systems of equations
NASA Technical Reports Server (NTRS)
Storaasli, Olaf O.
1994-01-01
The analysis and design of complex aerospace structures requires the rapid solution of large systems of linear and nonlinear equations, eigenvalue extraction for buckling, vibration and flutter modes, structural optimization and design sensitivity calculation. Computers with multiple processors and vector capabilities can offer substantial computational advantages over traditional scalar computer for these analyses. These computers fall into two categories: shared memory computers and distributed memory computers. This presentation covers general-purpose, highly efficient algorithms for generation/assembly or element matrices, solution of systems of linear and nonlinear equations, eigenvalue and design sensitivity analysis and optimization. All algorithms are coded in FORTRAN for shared memory computers and many are adapted to distributed memory computers. The capability and numerical performance of these algorithms will be addressed.
ERIC Educational Resources Information Center
Zhang, Yili; Smolen, Paul; Baxter, Douglas A.; Byrne, John H.
2010-01-01
Memory consolidation and reconsolidation require kinase activation and protein synthesis. Blocking either process during or shortly after training or recall disrupts memory stabilization, which suggests the existence of a critical time window during which these processes are necessary. Using a computational model of kinase synthesis and…
NASA Technical Reports Server (NTRS)
Ratner, R. S.; Shapiro, E. B.; Zeidler, H. M.; Wahlstrom, S. E.; Clark, C. B.; Goldberg, J.
1973-01-01
This final report summarizes the work on the design of a fault tolerant digital computer for aircraft. Volume 2 is composed of two parts. Part 1 is concerned with the computational requirements associated with an advanced commercial aircraft. Part 2 reviews the technology that will be available for the implementation of the computer in the 1975-1985 period. With regard to the computation task 26 computations have been categorized according to computational load, memory requirements, criticality, permitted down-time, and the need to save data in order to effect a roll-back. The technology part stresses the impact of large scale integration (LSI) on the realization of logic and memory. Also considered was module interconnection possibilities so as to minimize fault propagation.
Memory-Efficient Analysis of Dense Functional Connectomes.
Loewe, Kristian; Donohue, Sarah E; Schoenfeld, Mircea A; Kruse, Rudolf; Borgelt, Christian
2016-01-01
The functioning of the human brain relies on the interplay and integration of numerous individual units within a complex network. To identify network configurations characteristic of specific cognitive tasks or mental illnesses, functional connectomes can be constructed based on the assessment of synchronous fMRI activity at separate brain sites, and then analyzed using graph-theoretical concepts. In most previous studies, relatively coarse parcellations of the brain were used to define regions as graphical nodes. Such parcellated connectomes are highly dependent on parcellation quality because regional and functional boundaries need to be relatively consistent for the results to be interpretable. In contrast, dense connectomes are not subject to this limitation, since the parcellation inherent to the data is used to define graphical nodes, also allowing for a more detailed spatial mapping of connectivity patterns. However, dense connectomes are associated with considerable computational demands in terms of both time and memory requirements. The memory required to explicitly store dense connectomes in main memory can render their analysis infeasible, especially when considering high-resolution data or analyses across multiple subjects or conditions. Here, we present an object-based matrix representation that achieves a very low memory footprint by computing matrix elements on demand instead of explicitly storing them. In doing so, memory required for a dense connectome is reduced to the amount needed to store the underlying time series data. Based on theoretical considerations and benchmarks, different matrix object implementations and additional programs (based on available Matlab functions and Matlab-based third-party software) are compared with regard to their computational efficiency. The matrix implementation based on on-demand computations has very low memory requirements, thus enabling analyses that would be otherwise infeasible to conduct due to insufficient memory. An open source software package containing the created programs is available for download.
Memory-Efficient Analysis of Dense Functional Connectomes
Loewe, Kristian; Donohue, Sarah E.; Schoenfeld, Mircea A.; Kruse, Rudolf; Borgelt, Christian
2016-01-01
The functioning of the human brain relies on the interplay and integration of numerous individual units within a complex network. To identify network configurations characteristic of specific cognitive tasks or mental illnesses, functional connectomes can be constructed based on the assessment of synchronous fMRI activity at separate brain sites, and then analyzed using graph-theoretical concepts. In most previous studies, relatively coarse parcellations of the brain were used to define regions as graphical nodes. Such parcellated connectomes are highly dependent on parcellation quality because regional and functional boundaries need to be relatively consistent for the results to be interpretable. In contrast, dense connectomes are not subject to this limitation, since the parcellation inherent to the data is used to define graphical nodes, also allowing for a more detailed spatial mapping of connectivity patterns. However, dense connectomes are associated with considerable computational demands in terms of both time and memory requirements. The memory required to explicitly store dense connectomes in main memory can render their analysis infeasible, especially when considering high-resolution data or analyses across multiple subjects or conditions. Here, we present an object-based matrix representation that achieves a very low memory footprint by computing matrix elements on demand instead of explicitly storing them. In doing so, memory required for a dense connectome is reduced to the amount needed to store the underlying time series data. Based on theoretical considerations and benchmarks, different matrix object implementations and additional programs (based on available Matlab functions and Matlab-based third-party software) are compared with regard to their computational efficiency. The matrix implementation based on on-demand computations has very low memory requirements, thus enabling analyses that would be otherwise infeasible to conduct due to insufficient memory. An open source software package containing the created programs is available for download. PMID:27965565
Continuous-variable quantum computing in optical time-frequency modes using quantum memories.
Humphreys, Peter C; Kolthammer, W Steven; Nunn, Joshua; Barbieri, Marco; Datta, Animesh; Walmsley, Ian A
2014-09-26
We develop a scheme for time-frequency encoded continuous-variable cluster-state quantum computing using quantum memories. In particular, we propose a method to produce, manipulate, and measure two-dimensional cluster states in a single spatial mode by exploiting the intrinsic time-frequency selectivity of Raman quantum memories. Time-frequency encoding enables the scheme to be extremely compact, requiring a number of memories that are a linear function of only the number of different frequencies in which the computational state is encoded, independent of its temporal duration. We therefore show that quantum memories can be a powerful component for scalable photonic quantum information processing architectures.
Experimentally modeling stochastic processes with less memory by the use of a quantum processor
Palsson, Matthew S.; Gu, Mile; Ho, Joseph; Wiseman, Howard M.; Pryde, Geoff J.
2017-01-01
Computer simulation of observable phenomena is an indispensable tool for engineering new technology, understanding the natural world, and studying human society. However, the most interesting systems are often so complex that simulating their future behavior demands storing immense amounts of information regarding how they have behaved in the past. For increasingly complex systems, simulation becomes increasingly difficult and is ultimately constrained by resources such as computer memory. Recent theoretical work shows that quantum theory can reduce this memory requirement beyond ultimate classical limits, as measured by a process’ statistical complexity, C. We experimentally demonstrate this quantum advantage in simulating stochastic processes. Our quantum implementation observes a memory requirement of Cq = 0.05 ± 0.01, far below the ultimate classical limit of C = 1. Scaling up this technique would substantially reduce the memory required in simulations of more complex systems. PMID:28168218
Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures.
Kleftogiannis, Dimitrios; Kalnis, Panos; Bajic, Vladimir B
2013-01-01
A fundamental problem in bioinformatics is genome assembly. Next-generation sequencing (NGS) technologies produce large volumes of fragmented genome reads, which require large amounts of memory to assemble the complete genome efficiently. With recent improvements in DNA sequencing technologies, it is expected that the memory footprint required for the assembly process will increase dramatically and will emerge as a limiting factor in processing widely available NGS-generated reads. In this report, we compare current memory-efficient techniques for genome assembly with respect to quality, memory consumption and execution time. Our experiments prove that it is possible to generate draft assemblies of reasonable quality on conventional multi-purpose computers with very limited available memory by choosing suitable assembly methods. Our study reveals the minimum memory requirements for different assembly programs even when data volume exceeds memory capacity by orders of magnitude. By combining existing methodologies, we propose two general assembly strategies that can improve short-read assembly approaches and result in reduction of the memory footprint. Finally, we discuss the possibility of utilizing cloud infrastructures for genome assembly and we comment on some findings regarding suitable computational resources for assembly.
A generalized memory test algorithm
NASA Technical Reports Server (NTRS)
Milner, E. J.
1982-01-01
A general algorithm for testing digital computer memory is presented. The test checks that (1) every bit can be cleared and set in each memory work, and (2) bits are not erroneously cleared and/or set elsewhere in memory at the same time. The algorithm can be applied to any size memory block and any size memory word. It is concise and efficient, requiring the very few cycles through memory. For example, a test of 16-bit-word-size memory requries only 384 cycles through memory. Approximately 15 seconds were required to test a 32K block of such memory, using a microcomputer having a cycle time of 133 nanoseconds.
A comparison of several methods of solving nonlinear regression groundwater flow problems
Cooley, Richard L.
1985-01-01
Computational efficiency and computer memory requirements for four methods of minimizing functions were compared for four test nonlinear-regression steady state groundwater flow problems. The fastest methods were the Marquardt and quasi-linearization methods, which required almost identical computer times and numbers of iterations; the next fastest was the quasi-Newton method, and last was the Fletcher-Reeves method, which did not converge in 100 iterations for two of the problems. The fastest method per iteration was the Fletcher-Reeves method, and this was followed closely by the quasi-Newton method. The Marquardt and quasi-linearization methods were slower. For all four methods the speed per iteration was directly related to the number of parameters in the model. However, this effect was much more pronounced for the Marquardt and quasi-linearization methods than for the other two. Hence the quasi-Newton (and perhaps Fletcher-Reeves) method might be more efficient than either the Marquardt or quasi-linearization methods if the number of parameters in a particular model were large, although this remains to be proven. The Marquardt method required somewhat less central memory than the quasi-linearization metilod for three of the four problems. For all four problems the quasi-Newton method required roughly two thirds to three quarters of the memory required by the Marquardt method, and the Fletcher-Reeves method required slightly less memory than the quasi-Newton method. Memory requirements were not excessive for any of the four methods.
A view of Kanerva's sparse distributed memory
NASA Technical Reports Server (NTRS)
Denning, P. J.
1986-01-01
Pentti Kanerva is working on a new class of computers, which are called pattern computers. Pattern computers may close the gap between capabilities of biological organisms to recognize and act on patterns (visual, auditory, tactile, or olfactory) and capabilities of modern computers. Combinations of numeric, symbolic, and pattern computers may one day be capable of sustaining robots. The overview of the requirements for a pattern computer, a summary of Kanerva's Sparse Distributed Memory (SDM), and examples of tasks this computer can be expected to perform well are given.
FPGA-Based, Self-Checking, Fault-Tolerant Computers
NASA Technical Reports Server (NTRS)
Some, Raphael; Rennels, David
2004-01-01
A proposed computer architecture would exploit the capabilities of commercially available field-programmable gate arrays (FPGAs) to enable computers to detect and recover from bit errors. The main purpose of the proposed architecture is to enable fault-tolerant computing in the presence of single-event upsets (SEUs). [An SEU is a spurious bit flip (also called a soft error) caused by a single impact of ionizing radiation.] The architecture would also enable recovery from some soft errors caused by electrical transients and, to some extent, from intermittent and permanent (hard) errors caused by aging of electronic components. A typical FPGA of the current generation contains one or more complete processor cores, memories, and highspeed serial input/output (I/O) channels, making it possible to shrink a board-level processor node to a single integrated-circuit chip. Custom, highly efficient microcontrollers, general-purpose computers, custom I/O processors, and signal processors can be rapidly and efficiently implemented by use of FPGAs. Unfortunately, FPGAs are susceptible to SEUs. Prior efforts to mitigate the effects of SEUs have yielded solutions that degrade performance of the system and require support from external hardware and software. In comparison with other fault-tolerant- computing architectures (e.g., triple modular redundancy), the proposed architecture could be implemented with less circuitry and lower power demand. Moreover, the fault-tolerant computing functions would require only minimal support from circuitry outside the central processing units (CPUs) of computers, would not require any software support, and would be largely transparent to software and to other computer hardware. There would be two types of modules: a self-checking processor module and a memory system (see figure). The self-checking processor module would be implemented on a single FPGA and would be capable of detecting its own internal errors. It would contain two CPUs executing identical programs in lock step, with comparison of their outputs to detect errors. It would also contain various cache local memory circuits, communication circuits, and configurable special-purpose processors that would use self-checking checkers. (The basic principle of the self-checking checker method is to utilize logic circuitry that generates error signals whenever there is an error in either the checker or the circuit being checked.) The memory system would comprise a main memory and a hardware-controlled check-pointing system (CPS) based on a buffer memory denoted the recovery cache. The main memory would contain random-access memory (RAM) chips and FPGAs that would, in addition to everything else, implement double-error-detecting and single-error-correcting memory functions to enable recovery from single-bit errors.
Validation Test Report for the Automated Optical Processing System (AOPS) Version 4.8
2013-06-28
be familiar with UNIX; BASH shell programming; and remote sensing, particularly regarding computer processing of satellite data. The system memory ...and storage requirements are difficult to gauge. The amount of memory needed is dependent upon the amount and type of satellite data you wish to...process; the larger the area, the larger the memory requirement. For example, the entire Atlantic Ocean will require more processing power than the
Importance of balanced architectures in the design of high-performance imaging systems
NASA Astrophysics Data System (ADS)
Sgro, Joseph A.; Stanton, Paul C.
1999-03-01
Imaging systems employed in demanding military and industrial applications, such as automatic target recognition and computer vision, typically require real-time high-performance computing resources. While high- performances computing systems have traditionally relied on proprietary architectures and custom components, recent advances in high performance general-purpose microprocessor technology have produced an abundance of low cost components suitable for use in high-performance computing systems. A common pitfall in the design of high performance imaging system, particularly systems employing scalable multiprocessor architectures, is the failure to balance computational and memory bandwidth. The performance of standard cluster designs, for example, in which several processors share a common memory bus, is typically constrained by memory bandwidth. The symptom characteristic of this problem is failure to the performance of the system to scale as more processors are added. The problem becomes exacerbated if I/O and memory functions share the same bus. The recent introduction of microprocessors with large internal caches and high performance external memory interfaces makes it practical to design high performance imaging system with balanced computational and memory bandwidth. Real word examples of such designs will be presented, along with a discussion of adapting algorithm design to best utilize available memory bandwidth.
Cryogenic Memories based on Spin-Singlet and Spin-Triplet Ferromagnetic Josephson Junctions
NASA Astrophysics Data System (ADS)
Gingrich, Eric
The last several decades have seen an explosion in the use and size of computers for scientific applications. The US Department of Energy has set an ExaScale computing goal for high performance computing that is projected to be unattainable by current CMOS computing designs. This has led to a renewed interest in superconducting computing as a means of beating these projections. One of the primary requirements of this thrust is the development of an efficient cryogenic memory. Estimates of power consumption of early Rapid Single Flux Quantum (RSFQ) memory designs are on the order of MW, far too steep for any real application. Therefore, other memory concepts are required. S/F/S Josephson Junctions, a class of device in which two superconductors (S) are separated by one or more ferromagnetic layers (F) has shown promise as a memory element. Several different systems have been proposed utilizing either the spin-singlet or spin-triplet superconducting states. This talk will discuss the concepts underpinning these devices, and the recent work done to demonstrate their feasibility. This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via U.S. Army Research Office Contract W911NF-14-C-0115.
Donato, David I.
2013-01-01
A specialized technique is used to compute weighted ordinary least-squares (OLS) estimates of the parameters of the National Descriptive Model of Mercury in Fish (NDMMF) in less time using less computer memory than general methods. The characteristics of the NDMMF allow the two products X'X and X'y in the normal equations to be filled out in a second or two of computer time during a single pass through the N data observations. As a result, the matrix X does not have to be stored in computer memory and the computationally expensive matrix multiplications generally required to produce X'X and X'y do not have to be carried out. The normal equations may then be solved to determine the best-fit parameters in the OLS sense. The computational solution based on this specialized technique requires O(8p2+16p) bytes of computer memory for p parameters on a machine with 8-byte double-precision numbers. This publication includes a reference implementation of this technique and a Gaussian-elimination solver in preliminary custom software.
GenomicTools: a computational platform for developing high-throughput analytics in genomics.
Tsirigos, Aristotelis; Haiminen, Niina; Bilal, Erhan; Utro, Filippo
2012-01-15
Recent advances in sequencing technology have resulted in the dramatic increase of sequencing data, which, in turn, requires efficient management of computational resources, such as computing time, memory requirements as well as prototyping of computational pipelines. We present GenomicTools, a flexible computational platform, comprising both a command-line set of tools and a C++ API, for the analysis and manipulation of high-throughput sequencing data such as DNA-seq, RNA-seq, ChIP-seq and MethylC-seq. GenomicTools implements a variety of mathematical operations between sets of genomic regions thereby enabling the prototyping of computational pipelines that can address a wide spectrum of tasks ranging from pre-processing and quality control to meta-analyses. Additionally, the GenomicTools platform is designed to analyze large datasets of any size by minimizing memory requirements. In practical applications, where comparable, GenomicTools outperforms existing tools in terms of both time and memory usage. The GenomicTools platform (version 2.0.0) was implemented in C++. The source code, documentation, user manual, example datasets and scripts are available online at http://code.google.com/p/ibm-cbc-genomic-tools.
NASA Astrophysics Data System (ADS)
Hunter, Geoffrey
2004-01-01
A computational process is classified according to the theoretical model that is capable of executing it; computational processes that require a non-predeterminable amount of intermediate storage for their execution are Turing-machine (TM) processes, while those whose storage are predeterminable are Finite Automation (FA) processes. Simple processes (such as traffic light controller) are executable by Finite Automation, whereas the most general kind of computation requires a Turing Machine for its execution. This implies that a TM process must have a non-predeterminable amount of memory allocated to it at intermediate instants of its execution; i.e. dynamic memory allocation. Many processes encountered in practice are TM processes. The implication for computational practice is that the hardware (CPU) architecture and its operating system must facilitate dynamic memory allocation, and that the programming language used to specify TM processes must have statements with the semantic attribute of dynamic memory allocation, for in Alan Turing"s thesis on computation (1936) the "standard description" of a process is invariant over the most general data that the process is designed to process; i.e. the program describing the process should never have to be modified to allow for differences in the data that is to be processed in different instantiations; i.e. data-invariant programming. Any non-trivial program is partitioned into sub-programs (procedures, subroutines, functions, modules, etc). Examination of the calls/returns between the subprograms reveals that they are nodes in a tree-structure; this tree-structure is independent of the programming language used to encode (define) the process. Each sub-program typically needs some memory for its own use (to store values intermediate between its received data and its computed results); this locally required memory is not needed before the subprogram commences execution, and it is not needed after its execution terminates; it may be allocated as its execution commences, and deallocated as its execution terminates, and if the amount of this local memory is not known until just before execution commencement, then it is essential that it be allocated dynamically as the first action of its execution. This dynamically allocated/deallocated storage of each subprogram"s intermediate values, conforms with the stack discipline; i.e. last allocated = first to be deallocated, an incidental benefit of which is automatic overlaying of variables. This stack-based dynamic memory allocation was a semantic implication of the nested block structure that originated in the ALGOL-60 programming language. AGLOL-60 was a TM language, because the amount of memory allocated on subprogram (block/procedure) entry (for arrays, etc) was computable at execution time. A more general requirement of a Turing machine process is for code generation at run-time; this mandates access to the source language processor (compiler/interpretor) during execution of the process. This fundamental aspect of computer science is important to the future of system design, because it has been overlooked throughout the 55 years since modern computing began in 1048. The popular computer systems of this first half-century of computing were constrained by compile-time (or even operating system boot-time) memory allocation, and were thus limited to executing FA processes. The practical effect was that the distinction between the data-invariant program and its variable data was blurred; programmers had to make trial and error executions, modifying the program"s compile-time constants (array dimensions) to iterate towards the values required at run-time by the data being processed. This era of trial and error computing still persists; it pervades the culture of current (2003) computing practice.
Computational Issues in Damping Identification for Large Scale Problems
NASA Technical Reports Server (NTRS)
Pilkey, Deborah L.; Roe, Kevin P.; Inman, Daniel J.
1997-01-01
Two damping identification methods are tested for efficiency in large-scale applications. One is an iterative routine, and the other a least squares method. Numerical simulations have been performed on multiple degree-of-freedom models to test the effectiveness of the algorithm and the usefulness of parallel computation for the problems. High Performance Fortran is used to parallelize the algorithm. Tests were performed using the IBM-SP2 at NASA Ames Research Center. The least squares method tested incurs high communication costs, which reduces the benefit of high performance computing. This method's memory requirement grows at a very rapid rate meaning that larger problems can quickly exceed available computer memory. The iterative method's memory requirement grows at a much slower pace and is able to handle problems with 500+ degrees of freedom on a single processor. This method benefits from parallelization, and significant speedup can he seen for problems of 100+ degrees-of-freedom.
NASA Technical Reports Server (NTRS)
Tuccillo, J. J.
1984-01-01
Numerical Weather Prediction (NWP), for both operational and research purposes, requires only fast computational speed but also large memory. A technique for solving the Primitive Equations for atmospheric motion on the CYBER 205, as implemented in the Mesoscale Atmospheric Simulation System, which is fully vectorized and requires substantially less memory than other techniques such as the Leapfrog or Adams-Bashforth Schemes is discussed. The technique presented uses the Euler-Backard time marching scheme. Also discussed are several techniques for reducing computational time of the model by replacing slow intrinsic routines by faster algorithms which use only hardware vector instructions.
Discriminative Hierarchical K-Means Tree for Large-Scale Image Classification.
Chen, Shizhi; Yang, Xiaodong; Tian, Yingli
2015-09-01
A key challenge in large-scale image classification is how to achieve efficiency in terms of both computation and memory without compromising classification accuracy. The learning-based classifiers achieve the state-of-the-art accuracies, but have been criticized for the computational complexity that grows linearly with the number of classes. The nonparametric nearest neighbor (NN)-based classifiers naturally handle large numbers of categories, but incur prohibitively expensive computation and memory costs. In this brief, we present a novel classification scheme, i.e., discriminative hierarchical K-means tree (D-HKTree), which combines the advantages of both learning-based and NN-based classifiers. The complexity of the D-HKTree only grows sublinearly with the number of categories, which is much better than the recent hierarchical support vector machines-based methods. The memory requirement is the order of magnitude less than the recent Naïve Bayesian NN-based approaches. The proposed D-HKTree classification scheme is evaluated on several challenging benchmark databases and achieves the state-of-the-art accuracies, while with significantly lower computation cost and memory requirement.
Multi-processor including data flow accelerator module
Davidson, George S.; Pierce, Paul E.
1990-01-01
An accelerator module for a data flow computer includes an intelligent memory. The module is added to a multiprocessor arrangement and uses a shared tagged memory architecture in the data flow computer. The intelligent memory module assigns locations for holding data values in correspondence with arcs leading to a node in a data dependency graph. Each primitive computation is associated with a corresponding memory cell, including a number of slots for operands needed to execute a primitive computation, a primitive identifying pointer, and linking slots for distributing the result of the cell computation to other cells requiring that result as an operand. Circuitry is provided for utilizing tag bits to determine automatically when all operands required by a processor are available and for scheduling the primitive for execution in a queue. Each memory cell of the module may be associated with any of the primitives, and the particular primitive to be executed by the processor associated with the cell is identified by providing an index, such as the cell number for the primitive, to the primitive lookup table of starting addresses. The module thus serves to perform functions previously performed by a number of sections of data flow architectures and coexists with conventional shared memory therein. A multiprocessing system including the module operates in a hybrid mode, wherein the same processing modules are used to perform some processing in a sequential mode, under immediate control of an operating system, while performing other processing in a data flow mode.
Persistent Memory in Single Node Delay-Coupled Reservoir Computing.
Kovac, André David; Koall, Maximilian; Pipa, Gordon; Toutounji, Hazem
2016-01-01
Delays are ubiquitous in biological systems, ranging from genetic regulatory networks and synaptic conductances, to predator/pray population interactions. The evidence is mounting, not only to the presence of delays as physical constraints in signal propagation speed, but also to their functional role in providing dynamical diversity to the systems that comprise them. The latter observation in biological systems inspired the recent development of a computational architecture that harnesses this dynamical diversity, by delay-coupling a single nonlinear element to itself. This architecture is a particular realization of Reservoir Computing, where stimuli are injected into the system in time rather than in space as is the case with classical recurrent neural network realizations. This architecture also exhibits an internal memory which fades in time, an important prerequisite to the functioning of any reservoir computing device. However, fading memory is also a limitation to any computation that requires persistent storage. In order to overcome this limitation, the current work introduces an extended version to the single node Delay-Coupled Reservoir, that is based on trained linear feedback. We show by numerical simulations that adding task-specific linear feedback to the single node Delay-Coupled Reservoir extends the class of solvable tasks to those that require nonfading memory. We demonstrate, through several case studies, the ability of the extended system to carry out complex nonlinear computations that depend on past information, whereas the computational power of the system with fading memory alone quickly deteriorates. Our findings provide the theoretical basis for future physical realizations of a biologically-inspired ultrafast computing device with extended functionality.
Persistent Memory in Single Node Delay-Coupled Reservoir Computing
Pipa, Gordon; Toutounji, Hazem
2016-01-01
Delays are ubiquitous in biological systems, ranging from genetic regulatory networks and synaptic conductances, to predator/pray population interactions. The evidence is mounting, not only to the presence of delays as physical constraints in signal propagation speed, but also to their functional role in providing dynamical diversity to the systems that comprise them. The latter observation in biological systems inspired the recent development of a computational architecture that harnesses this dynamical diversity, by delay-coupling a single nonlinear element to itself. This architecture is a particular realization of Reservoir Computing, where stimuli are injected into the system in time rather than in space as is the case with classical recurrent neural network realizations. This architecture also exhibits an internal memory which fades in time, an important prerequisite to the functioning of any reservoir computing device. However, fading memory is also a limitation to any computation that requires persistent storage. In order to overcome this limitation, the current work introduces an extended version to the single node Delay-Coupled Reservoir, that is based on trained linear feedback. We show by numerical simulations that adding task-specific linear feedback to the single node Delay-Coupled Reservoir extends the class of solvable tasks to those that require nonfading memory. We demonstrate, through several case studies, the ability of the extended system to carry out complex nonlinear computations that depend on past information, whereas the computational power of the system with fading memory alone quickly deteriorates. Our findings provide the theoretical basis for future physical realizations of a biologically-inspired ultrafast computing device with extended functionality. PMID:27783690
NASA Technical Reports Server (NTRS)
Bradley, D. B.; Irwin, J. D.
1974-01-01
A computer simulation model for a multiprocessor computer is developed that is useful for studying the problem of matching multiprocessor's memory space, memory bandwidth and numbers and speeds of processors with aggregate job set characteristics. The model assumes an input work load of a set of recurrent jobs. The model includes a feedback scheduler/allocator which attempts to improve system performance through higher memory bandwidth utilization by matching individual job requirements for space and bandwidth with space availability and estimates of bandwidth availability at the times of memory allocation. The simulation model includes provisions for specifying precedence relations among the jobs in a job set, and provisions for specifying precedence execution of TMR (Triple Modular Redundant and SIMPLEX (non redundant) jobs.
Computational principles of working memory in sentence comprehension.
Lewis, Richard L; Vasishth, Shravan; Van Dyke, Julie A
2006-10-01
Understanding a sentence requires a working memory of the partial products of comprehension, so that linguistic relations between temporally distal parts of the sentence can be rapidly computed. We describe an emerging theoretical framework for this working memory system that incorporates several independently motivated principles of memory: a sharply limited attentional focus, rapid retrieval of item (but not order) information subject to interference from similar items, and activation decay (forgetting over time). A computational model embodying these principles provides an explanation of the functional capacities and severe limitations of human processing, as well as accounts of reading times. The broad implication is that the detailed nature of cross-linguistic sentence processing emerges from the interaction of general principles of human memory with the specialized task of language comprehension.
Parallel processing for scientific computations
NASA Technical Reports Server (NTRS)
Alkhatib, Hasan S.
1995-01-01
The scope of this project dealt with the investigation of the requirements to support distributed computing of scientific computations over a cluster of cooperative workstations. Various experiments on computations for the solution of simultaneous linear equations were performed in the early phase of the project to gain experience in the general nature and requirements of scientific applications. A specification of a distributed integrated computing environment, DICE, based on a distributed shared memory communication paradigm has been developed and evaluated. The distributed shared memory model facilitates porting existing parallel algorithms that have been designed for shared memory multiprocessor systems to the new environment. The potential of this new environment is to provide supercomputing capability through the utilization of the aggregate power of workstations cooperating in a cluster interconnected via a local area network. Workstations, generally, do not have the computing power to tackle complex scientific applications, making them primarily useful for visualization, data reduction, and filtering as far as complex scientific applications are concerned. There is a tremendous amount of computing power that is left unused in a network of workstations. Very often a workstation is simply sitting idle on a desk. A set of tools can be developed to take advantage of this potential computing power to create a platform suitable for large scientific computations. The integration of several workstations into a logical cluster of distributed, cooperative, computing stations presents an alternative to shared memory multiprocessor systems. In this project we designed and evaluated such a system.
The impact of supercomputers on experimentation: A view from a national laboratory
NASA Technical Reports Server (NTRS)
Peterson, V. L.; Arnold, J. O.
1985-01-01
The relative roles of large scale scientific computers and physical experiments in several science and engineering disciplines are discussed. Increasing dependence on computers is shown to be motivated both by the rapid growth in computer speed and memory, which permits accurate numerical simulation of complex physical phenomena, and by the rapid reduction in the cost of performing a calculation, which makes computation an increasingly attractive complement to experimentation. Computer speed and memory requirements are presented for selected areas of such disciplines as fluid dynamics, aerodynamics, aerothermodynamics, chemistry, atmospheric sciences, astronomy, and astrophysics, together with some examples of the complementary nature of computation and experiment. Finally, the impact of the emerging role of computers in the technical disciplines is discussed in terms of both the requirements for experimentation and the attainment of previously inaccessible information on physical processes.
Optoelectronic Inner-Product Neural Associative Memory
NASA Technical Reports Server (NTRS)
Liu, Hua-Kuang
1993-01-01
Optoelectronic apparatus acts as artificial neural network performing associative recall of binary images. Recall process is iterative one involving optical computation of inner products between binary input vector and one or more reference binary vectors in memory. Inner-product method requires far less memory space than matrix-vector method.
Task Assignment Heuristics for Distributed CFD Applications
NASA Technical Reports Server (NTRS)
Lopez-Benitez, N.; Djomehri, M. J.; Biswas, R.; Biegel, Bryan (Technical Monitor)
2001-01-01
CFD applications require high-performance computational platforms: 1. Complex physics and domain configuration demand strongly coupled solutions; 2. Applications are CPU and memory intensive; and 3. Huge resource requirements can only be satisfied by teraflop-scale machines or distributed computing.
High Performance Implementation of 3D Convolutional Neural Networks on a GPU.
Lan, Qiang; Wang, Zelong; Wen, Mei; Zhang, Chunyuan; Wang, Yijie
2017-01-01
Convolutional neural networks have proven to be highly successful in applications such as image classification, object tracking, and many other tasks based on 2D inputs. Recently, researchers have started to apply convolutional neural networks to video classification, which constitutes a 3D input and requires far larger amounts of memory and much more computation. FFT based methods can reduce the amount of computation, but this generally comes at the cost of an increased memory requirement. On the other hand, the Winograd Minimal Filtering Algorithm (WMFA) can reduce the number of operations required and thus can speed up the computation, without increasing the required memory. This strategy was shown to be successful for 2D neural networks. We implement the algorithm for 3D convolutional neural networks and apply it to a popular 3D convolutional neural network which is used to classify videos and compare it to cuDNN. For our highly optimized implementation of the algorithm, we observe a twofold speedup for most of the 3D convolution layers of our test network compared to the cuDNN version.
High Performance Implementation of 3D Convolutional Neural Networks on a GPU
Wang, Zelong; Wen, Mei; Zhang, Chunyuan; Wang, Yijie
2017-01-01
Convolutional neural networks have proven to be highly successful in applications such as image classification, object tracking, and many other tasks based on 2D inputs. Recently, researchers have started to apply convolutional neural networks to video classification, which constitutes a 3D input and requires far larger amounts of memory and much more computation. FFT based methods can reduce the amount of computation, but this generally comes at the cost of an increased memory requirement. On the other hand, the Winograd Minimal Filtering Algorithm (WMFA) can reduce the number of operations required and thus can speed up the computation, without increasing the required memory. This strategy was shown to be successful for 2D neural networks. We implement the algorithm for 3D convolutional neural networks and apply it to a popular 3D convolutional neural network which is used to classify videos and compare it to cuDNN. For our highly optimized implementation of the algorithm, we observe a twofold speedup for most of the 3D convolution layers of our test network compared to the cuDNN version. PMID:29250109
SUMC fault tolerant computer system
NASA Technical Reports Server (NTRS)
1980-01-01
The results of the trade studies are presented. These trades cover: establishing the basic configuration, establishing the CPU/memory configuration, establishing an approach to crosstrapping interfaces, defining the requirements of the redundancy management unit (RMU), establishing a spare plane switching strategy for the fault-tolerant memory (FTM), and identifying the most cost effective way of extending the memory addressing capability beyond the 64 K-bytes (K=1024) of SUMC-II B. The results of the design are compiled in Contract End Item (CEI) Specification for the NASA Standard Spacecraft Computer II (NSSC-II), IBM 7934507. The implementation of the FTM and memory address expansion.
Computational aerodynamics development and outlook /Dryden Lecture in Research for 1979/
NASA Technical Reports Server (NTRS)
Chapman, D. R.
1979-01-01
Some past developments and current examples of computational aerodynamics are briefly reviewed. An assessment is made of the requirements on future computer memory and speed imposed by advanced numerical simulations, giving emphasis to the Reynolds averaged Navier-Stokes equations and to turbulent eddy simulations. Experimental scales of turbulence structure are used to determine the mesh spacings required to adequately resolve turbulent energy and shear. Assessment also is made of the changing market environment for developing future large computers, and of the projections of micro-electronics memory and logic technology that affect future computer capability. From the two assessments, estimates are formed of the future time scale in which various advanced types of aerodynamic flow simulations could become feasible. Areas of research judged especially relevant to future developments are noted.
NASA Astrophysics Data System (ADS)
Saputro, Dewi Retno Sari; Widyaningsih, Purnami
2017-08-01
In general, the parameter estimation of GWOLR model uses maximum likelihood method, but it constructs a system of nonlinear equations, making it difficult to find the solution. Therefore, an approximate solution is needed. There are two popular numerical methods: the methods of Newton and Quasi-Newton (QN). Newton's method requires large-scale time in executing the computation program since it contains Jacobian matrix (derivative). QN method overcomes the drawback of Newton's method by substituting derivative computation into a function of direct computation. The QN method uses Hessian matrix approach which contains Davidon-Fletcher-Powell (DFP) formula. The Broyden-Fletcher-Goldfarb-Shanno (BFGS) method is categorized as the QN method which has the DFP formula attribute of having positive definite Hessian matrix. The BFGS method requires large memory in executing the program so another algorithm to decrease memory usage is needed, namely Low Memory BFGS (LBFGS). The purpose of this research is to compute the efficiency of the LBFGS method in the iterative and recursive computation of Hessian matrix and its inverse for the GWOLR parameter estimation. In reference to the research findings, we found out that the BFGS and LBFGS methods have arithmetic operation schemes, including O(n2) and O(nm).
Data compression strategies for ptychographic diffraction imaging
NASA Astrophysics Data System (ADS)
Loetgering, Lars; Rose, Max; Treffer, David; Vartanyants, Ivan A.; Rosenhahn, Axel; Wilhein, Thomas
2017-12-01
Ptychography is a computational imaging method for solving inverse scattering problems. To date, the high amount of redundancy present in ptychographic data sets requires computer memory that is orders of magnitude larger than the retrieved information. Here, we propose and compare data compression strategies that significantly reduce the amount of data required for wavefield inversion. Information metrics are used to measure the amount of data redundancy present in ptychographic data. Experimental results demonstrate the technique to be memory efficient and stable in the presence of systematic errors such as partial coherence and noise.
40 CFR 1042.115 - Other requirements.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 32 2010-07-01 2010-07-01 false Other requirements. 1042.115 Section 1042.115 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) AIR POLLUTION CONTROLS...). (3) The onboard computer log must record in nonvolatile computer memory all incidents of engine...
Data Movement Dominates: Advanced Memory Technology to Address the Real Exascale Power Problem
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bergman, Keren
Energy is the fundamental barrier to Exascale supercomputing and is dominated by the cost of moving data from one point to another, not computation. Similarly, performance is dominated by data movement, not computation. The solution to this problem requires three critical technologies: 3D integration, optical chip-to-chip communication, and a new communication model. The central goal of the Sandia led "Data Movement Dominates" project aimed to develop memory systems and new architectures based on these technologies that have the potential to lower the cost of local memory accesses by orders of magnitude and provide substantially more bandwidth. Only through these transformationalmore » advances can future systems reach the goals of Exascale computing with a manageable power budgets. The Sandia led team included co-PIs from Columbia University, Lawrence Berkeley Lab, and the University of Maryland. The Columbia effort of Data Movement Dominates focused on developing a physically accurate simulation environment and experimental verification for optically-connected memory (OCM) systems that can enable continued performance scaling through high-bandwidth capacity, energy-efficient bit-rate transparency, and time-of-flight latency. With OCM, memory device parallelism and total capacity can scale to match future high-performance computing requirements without sacrificing data-movement efficiency. When we consider systems with integrated photonics, links to memory can be seamlessly integrated with the interconnection network-in a sense, memory becomes a primary aspect of the interconnection network. At the core of the Columbia effort, toward expanding our understanding of OCM enabled computing we have created an integrated modeling and simulation environment that uniquely integrates the physical behavior of the optical layer. The PhoenxSim suite of design and software tools developed under this effort has enabled the co-design of and performance evaluation photonics-enabled OCM architectures on Exascale computing systems.« less
Lossy Wavefield Compression for Full-Waveform Inversion
NASA Astrophysics Data System (ADS)
Boehm, C.; Fichtner, A.; de la Puente, J.; Hanzich, M.
2015-12-01
We present lossy compression techniques, tailored to the inexact computation of sensitivity kernels, that significantly reduce the memory requirements of adjoint-based minimization schemes. Adjoint methods are a powerful tool to solve tomography problems in full-waveform inversion (FWI). Yet they face the challenge of massive memory requirements caused by the opposite directions of forward and adjoint simulations and the necessity to access both wavefields simultaneously during the computation of the sensitivity kernel. Thus, storage, I/O operations, and memory bandwidth become key topics in FWI. In this talk, we present strategies for the temporal and spatial compression of the forward wavefield. This comprises re-interpolation with coarse time steps and an adaptive polynomial degree of the spectral element shape functions. In addition, we predict the projection errors on a hierarchy of grids and re-quantize the residuals with an adaptive floating-point accuracy to improve the approximation. Furthermore, we use the first arrivals of adjoint waves to identify "shadow zones" that do not contribute to the sensitivity kernel at all. Updating and storing the wavefield within these shadow zones is skipped, which reduces memory requirements and computational costs at the same time. Compared to check-pointing, our approach has only a negligible computational overhead, utilizing the fact that a sufficiently accurate sensitivity kernel does not require a fully resolved forward wavefield. Furthermore, we use adaptive compression thresholds during the FWI iterations to ensure convergence. Numerical experiments on the reservoir scale and for the Western Mediterranean prove the high potential of this approach with an effective compression factor of 500-1000. Furthermore, it is computationally cheap and easy to integrate in both, finite-differences and finite-element wave propagation codes.
NASA Astrophysics Data System (ADS)
Kavehei, Omid; Linn, Eike; Nielen, Lutz; Tappertzhofen, Stefan; Skafidas, Efstratios; Valov, Ilia; Waser, Rainer
2013-05-01
We report on the implementation of an Associative Capacitive Network (ACN) based on the nondestructive capacitive readout of two Complementary Resistive Switches (2-CRSs). ACNs are capable of performing a fully parallel search for Hamming distances (i.e. similarity) between input and stored templates. Unlike conventional associative memories where charge retention is a key function and hence, they require frequent refresh cycles, in ACNs, information is retained in a nonvolatile resistive state and normal tasks are carried out through capacitive coupling between input and output nodes. Each device consists of two CRS cells and no selective element is needed, therefore, CMOS circuitry is only required in the periphery, for addressing and read-out. Highly parallel processing, nonvolatility, wide interconnectivity and low-energy consumption are significant advantages of ACNs over conventional and emerging associative memories. These characteristics make ACNs one of the promising candidates for applications in memory-intensive and cognitive computing, switches and routers as binary and ternary Content Addressable Memories (CAMs) and intelligent data processing.
James, Ella L; Bonsall, Michael B; Hoppitt, Laura; Tunbridge, Elizabeth M; Geddes, John R; Milton, Amy L; Holmes, Emily A
2015-08-01
Memory of a traumatic event becomes consolidated within hours. Intrusive memories can then flash back repeatedly into the mind's eye and cause distress. We investigated whether reconsolidation-the process during which memories become malleable when recalled-can be blocked using a cognitive task and whether such an approach can reduce these unbidden intrusions. We predicted that reconsolidation of a reactivated visual memory of experimental trauma could be disrupted by engaging in a visuospatial task that would compete for visual working memory resources. We showed that intrusive memories were virtually abolished by playing the computer game Tetris following a memory-reactivation task 24 hr after initial exposure to experimental trauma. Furthermore, both memory reactivation and playing Tetris were required to reduce subsequent intrusions (Experiment 2), consistent with reconsolidation-update mechanisms. A simple, noninvasive cognitive-task procedure administered after emotional memory has already consolidated (i.e., > 24 hours after exposure to experimental trauma) may prevent the recurrence of intrusive memories of those emotional events. © The Author(s) 2015.
James, Ella L.; Bonsall, Michael B.; Hoppitt, Laura; Tunbridge, Elizabeth M.; Geddes, John R.; Milton, Amy L.
2015-01-01
Memory of a traumatic event becomes consolidated within hours. Intrusive memories can then flash back repeatedly into the mind’s eye and cause distress. We investigated whether reconsolidation—the process during which memories become malleable when recalled—can be blocked using a cognitive task and whether such an approach can reduce these unbidden intrusions. We predicted that reconsolidation of a reactivated visual memory of experimental trauma could be disrupted by engaging in a visuospatial task that would compete for visual working memory resources. We showed that intrusive memories were virtually abolished by playing the computer game Tetris following a memory-reactivation task 24 hr after initial exposure to experimental trauma. Furthermore, both memory reactivation and playing Tetris were required to reduce subsequent intrusions (Experiment 2), consistent with reconsolidation-update mechanisms. A simple, noninvasive cognitive-task procedure administered after emotional memory has already consolidated (i.e., > 24 hours after exposure to experimental trauma) may prevent the recurrence of intrusive memories of those emotional events. PMID:26133572
A multiresolution approach to iterative reconstruction algorithms in X-ray computed tomography.
De Witte, Yoni; Vlassenbroeck, Jelle; Van Hoorebeke, Luc
2010-09-01
In computed tomography, the application of iterative reconstruction methods in practical situations is impeded by their high computational demands. Especially in high resolution X-ray computed tomography, where reconstruction volumes contain a high number of volume elements (several giga voxels), this computational burden prevents their actual breakthrough. Besides the large amount of calculations, iterative algorithms require the entire volume to be kept in memory during reconstruction, which quickly becomes cumbersome for large data sets. To overcome this obstacle, we present a novel multiresolution reconstruction, which greatly reduces the required amount of memory without significantly affecting the reconstructed image quality. It is shown that, combined with an efficient implementation on a graphical processing unit, the multiresolution approach enables the application of iterative algorithms in the reconstruction of large volumes at an acceptable speed using only limited resources.
Hippocampal activation during the recall of remote spatial memories in radial maze tasks.
Schlesiger, Magdalene I; Cressey, John C; Boublil, Brittney; Koenig, Julie; Melvin, Neal R; Leutgeb, Jill K; Leutgeb, Stefan
2013-11-01
Temporally graded retrograde amnesia is observed in human patients with medial temporal lobe lesions as well as in animal models of medial temporal lobe lesions. A time-limited role for these structures in memory recall has also been suggested by the observation that the rodent hippocampus and entorhinal cortex are activated during the retrieval of recent but not of remote memories. One notable exception is the recall of remote memories for platform locations in the water maze, which requires an intact hippocampus and results in hippocampal activation irrespective of the age of the memory. These findings raise the question whether the hippocampus is always involved in the recall of spatial memories or, alternatively, whether it might be required for procedural computations in the water maze task, such as for calculating a path to a hidden platform. We performed spatial memory testing in radial maze tasks to distinguish between these possibilities. Radial maze tasks require a choice between spatial locations on a center platform and thus have a lesser requirement for navigation than the water maze. However, we used a behavioral design in the radial maze that retained other aspects of the standard water maze task, such as the use of multiple start locations and retention testing in a single trial. Using the immediate early gene c-fos as a marker for neuronal activation, we found that all hippocampal subregions were more activated during the recall of remote compared to recent spatial memories. In areas CA3 and CA1, activation during remote memory testing was higher than in rats that were merely reexposed to the testing environment after the same time interval. Conversely, Fos levels in the dentate gyrus were increased after retention testing to the extent that was also observed in the corresponding exposure control group. This pattern of hippocampal activation was also obtained in a second version of the task that only used a single start arm instead of multiple start arms. The CA3 and CA1 activation during remote memory recall is consistent with the interpretation that an older memory might require increased pattern completion and/or relearning after longer time intervals. Irrespective of whether the hippocampus is required for remote memory recall, the hippocampus might engage in computations that either support recall of remote memories or that update remote memories. Copyright © 2013 Elsevier Inc. All rights reserved.
Development and Evaluation of a Casualty Evacuation Model for a European Conflict.
1985-12-01
EVAC, the computer code which implements our technique, has been used to solve a series of test problems in less time and requiring less memory than...the order of 1/K the amount of main memory for a K-commodity problem, so it can solve significantly larger problems than MCNF. I . 10 CHAPTER II A...technique may require only half the memory of the general L.P. package [6]. These advances are due to the efficient data structures which have been
HTMT-class Latency Tolerant Parallel Architecture for Petaflops Scale Computation
NASA Technical Reports Server (NTRS)
Sterling, Thomas; Bergman, Larry
2000-01-01
Computational Aero Sciences and other numeric intensive computation disciplines demand computing throughputs substantially greater than the Teraflops scale systems only now becoming available. The related fields of fluids, structures, thermal, combustion, and dynamic controls are among the interdisciplinary areas that in combination with sufficient resolution and advanced adaptive techniques may force performance requirements towards Petaflops. This will be especially true for compute intensive models such as Navier-Stokes are or when such system models are only part of a larger design optimization computation involving many design points. Yet recent experience with conventional MPP configurations comprising commodity processing and memory components has shown that larger scale frequently results in higher programming difficulty and lower system efficiency. While important advances in system software and algorithms techniques have had some impact on efficiency and programmability for certain classes of problems, in general it is unlikely that software alone will resolve the challenges to higher scalability. As in the past, future generations of high-end computers may require a combination of hardware architecture and system software advances to enable efficient operation at a Petaflops level. The NASA led HTMT project has engaged the talents of a broad interdisciplinary team to develop a new strategy in high-end system architecture to deliver petaflops scale computing in the 2004/5 timeframe. The Hybrid-Technology, MultiThreaded parallel computer architecture incorporates several advanced technologies in combination with an innovative dynamic adaptive scheduling mechanism to provide unprecedented performance and efficiency within practical constraints of cost, complexity, and power consumption. The emerging superconductor Rapid Single Flux Quantum electronics can operate at 100 GHz (the record is 770 GHz) and one percent of the power required by convention semiconductor logic. Wave Division Multiplexing optical communications can approach a peak per fiber bandwidth of 1 Tbps and the new Data Vortex network topology employing this technology can connect tens of thousands of ports providing a bi-section bandwidth on the order of a Petabyte per second with latencies well below 100 nanoseconds, even under heavy loads. Processor-in-Memory (PIM) technology combines logic and memory on the same chip exposing the internal bandwidth of the memory row buffers at low latency. And holographic storage photorefractive storage technologies provide high-density memory with access a thousand times faster than conventional disk technologies. Together these technologies enable a new class of shared memory system architecture with a peak performance in the range of a Petaflops but size and power requirements comparable to today's largest Teraflops scale systems. To achieve high-sustained performance, HTMT combines an advanced multithreading processor architecture with a memory-driven coarse-grained latency management strategy called "percolation", yielding high efficiency while reducing the much of the parallel programming burden. This paper will present the basic system architecture characteristics made possible through this series of advanced technologies and then give a detailed description of the new percolation approach to runtime latency management.
Stream-based Hebbian eigenfilter for real-time neuronal spike discrimination
2012-01-01
Background Principal component analysis (PCA) has been widely employed for automatic neuronal spike sorting. Calculating principal components (PCs) is computationally expensive, and requires complex numerical operations and large memory resources. Substantial hardware resources are therefore needed for hardware implementations of PCA. General Hebbian algorithm (GHA) has been proposed for calculating PCs of neuronal spikes in our previous work, which eliminates the needs of computationally expensive covariance analysis and eigenvalue decomposition in conventional PCA algorithms. However, large memory resources are still inherently required for storing a large volume of aligned spikes for training PCs. The large size memory will consume large hardware resources and contribute significant power dissipation, which make GHA difficult to be implemented in portable or implantable multi-channel recording micro-systems. Method In this paper, we present a new algorithm for PCA-based spike sorting based on GHA, namely stream-based Hebbian eigenfilter, which eliminates the inherent memory requirements of GHA while keeping the accuracy of spike sorting by utilizing the pseudo-stationarity of neuronal spikes. Because of the reduction of large hardware storage requirements, the proposed algorithm can lead to ultra-low hardware resources and power consumption of hardware implementations, which is critical for the future multi-channel micro-systems. Both clinical and synthetic neural recording data sets were employed for evaluating the accuracy of the stream-based Hebbian eigenfilter. The performance of spike sorting using stream-based eigenfilter and the computational complexity of the eigenfilter were rigorously evaluated and compared with conventional PCA algorithms. Field programmable logic arrays (FPGAs) were employed to implement the proposed algorithm, evaluate the hardware implementations and demonstrate the reduction in both power consumption and hardware memories achieved by the streaming computing Results and discussion Results demonstrate that the stream-based eigenfilter can achieve the same accuracy and is 10 times more computationally efficient when compared with conventional PCA algorithms. Hardware evaluations show that 90.3% logic resources, 95.1% power consumption and 86.8% computing latency can be reduced by the stream-based eigenfilter when compared with PCA hardware. By utilizing the streaming method, 92% memory resources and 67% power consumption can be saved when compared with the direct implementation of GHA. Conclusion Stream-based Hebbian eigenfilter presents a novel approach to enable real-time spike sorting with reduced computational complexity and hardware costs. This new design can be further utilized for multi-channel neuro-physiological experiments or chronic implants. PMID:22490725
NASA Technical Reports Server (NTRS)
Stehle, Roy H.; Ogier, Richard G.
1993-01-01
Alternatives for realizing a packet-based network switch for use on a frequency division multiple access/time division multiplexed (FDMA/TDM) geostationary communication satellite were investigated. Each of the eight downlink beams supports eight directed dwells. The design needed to accommodate multicast packets with very low probability of loss due to contention. Three switch architectures were designed and analyzed. An output-queued, shared bus system yielded a functionally simple system, utilizing a first-in, first-out (FIFO) memory per downlink dwell, but at the expense of a large total memory requirement. A shared memory architecture offered the most efficiency in memory requirements, requiring about half the memory of the shared bus design. The processing requirement for the shared-memory system adds system complexity that may offset the benefits of the smaller memory. An alternative design using a shared memory buffer per downlink beam decreases circuit complexity through a distributed design, and requires at most 1000 packets of memory more than the completely shared memory design. Modifications to the basic packet switch designs were proposed to accommodate circuit-switched traffic, which must be served on a periodic basis with minimal delay. Methods for dynamically controlling the downlink dwell lengths were developed and analyzed. These methods adapt quickly to changing traffic demands, and do not add significant complexity or cost to the satellite and ground station designs. Methods for reducing the memory requirement by not requiring the satellite to store full packets were also proposed and analyzed. In addition, optimal packet and dwell lengths were computed as functions of memory size for the three switch architectures.
High Performance Computing Assets for Ocean Acoustics Research
2016-11-18
independently on processing units with access to a typically available amount of memory, say 16 or 32 gigabytes. Our models require each processor to...allow results to be obtained with limited amounts of memory available to individual processing units (with no time frame for successful completion...put into use. One file server computer to store simulation output has also been purchased. The first workstation has 28 CPU cores, dual- thread , (56
40 CFR 1033.110 - Emission diagnostics-general requirements.
Code of Federal Regulations, 2011 CFR
2011-07-01
... engine operation. (d) Record and store in computer memory any diagnostic trouble codes showing a... and understand the diagnostic trouble codes stored in the onboard computer with generic tools and...
Fault-tolerant computer study. [logic designs for building block circuits
NASA Technical Reports Server (NTRS)
Rennels, D. A.; Avizienis, A. A.; Ercegovac, M. D.
1981-01-01
A set of building block circuits is described which can be used with commercially available microprocessors and memories to implement fault tolerant distributed computer systems. Each building block circuit is intended for VLSI implementation as a single chip. Several building blocks and associated processor and memory chips form a self checking computer module with self contained input output and interfaces to redundant communications buses. Fault tolerance is achieved by connecting self checking computer modules into a redundant network in which backup buses and computer modules are provided to circumvent failures. The requirements and design methodology which led to the definition of the building block circuits are discussed.
High-order hydrodynamic algorithms for exascale computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Morgan, Nathaniel Ray
Hydrodynamic algorithms are at the core of many laboratory missions ranging from simulating ICF implosions to climate modeling. The hydrodynamic algorithms commonly employed at the laboratory and in industry (1) typically lack requisite accuracy for complex multi- material vortical flows and (2) are not well suited for exascale computing due to poor data locality and poor FLOP/memory ratios. Exascale computing requires advances in both computer science and numerical algorithms. We propose to research the second requirement and create a new high-order hydrodynamic algorithm that has superior accuracy, excellent data locality, and excellent FLOP/memory ratios. This proposal will impact a broadmore » range of research areas including numerical theory, discrete mathematics, vorticity evolution, gas dynamics, interface instability evolution, turbulent flows, fluid dynamics and shock driven flows. If successful, the proposed research has the potential to radically transform simulation capabilities and help position the laboratory for computing at the exascale.« less
Computational efficiency improvements for image colorization
NASA Astrophysics Data System (ADS)
Yu, Chao; Sharma, Gaurav; Aly, Hussein
2013-03-01
We propose an efficient algorithm for colorization of greyscale images. As in prior work, colorization is posed as an optimization problem: a user specifies the color for a few scribbles drawn on the greyscale image and the color image is obtained by propagating color information from the scribbles to surrounding regions, while maximizing the local smoothness of colors. In this formulation, colorization is obtained by solving a large sparse linear system, which normally requires substantial computation and memory resources. Our algorithm improves the computational performance through three innovations over prior colorization implementations. First, the linear system is solved iteratively without explicitly constructing the sparse matrix, which significantly reduces the required memory. Second, we formulate each iteration in terms of integral images obtained by dynamic programming, reducing repetitive computation. Third, we use a coarseto- fine framework, where a lower resolution subsampled image is first colorized and this low resolution color image is upsampled to initialize the colorization process for the fine level. The improvements we develop provide significant speedup and memory savings compared to the conventional approach of solving the linear system directly using off-the-shelf sparse solvers, and allow us to colorize images with typical sizes encountered in realistic applications on typical commodity computing platforms.
DOE Office of Scientific and Technical Information (OSTI.GOV)
James, Conrad D.; Schiess, Adrian B.; Howell, Jamie
2013-10-01
The human brain (volume=1200cm3) consumes 20W and is capable of performing > 10^16 operations/s. Current supercomputer technology has reached 1015 operations/s, yet it requires 1500m^3 and 3MW, giving the brain a 10^12 advantage in operations/s/W/cm^3. Thus, to reach exascale computation, two achievements are required: 1) improved understanding of computation in biological tissue, and 2) a paradigm shift towards neuromorphic computing where hardware circuits mimic properties of neural tissue. To address 1), we will interrogate corticostriatal networks in mouse brain tissue slices, specifically with regard to their frequency filtering capabilities as a function of input stimulus. To address 2), we willmore » instantiate biological computing characteristics such as multi-bit storage into hardware devices with future computational and memory applications. Resistive memory devices will be modeled, designed, and fabricated in the MESA facility in consultation with our internal and external collaborators.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nash, T.; Atac, R.; Cook, A.
1989-03-06
The ACPMAPS multipocessor is a highly cost effective, local memory parallel computer with a hypercube or compound hypercube architecture. Communication requires the attention of only the two communicating nodes. The design is aimed at floating point intensive, grid like problems, particularly those with extreme computing requirements. The processing nodes of the system are single board array processors, each with a peak power of 20 Mflops, supported by 8 Mbytes of data and 2 Mbytes of instruction memory. The system currently being assembled has a peak power of 5 Gflops. The nodes are based on the Weitek XL Chip set. Themore » system delivers performance at approximately $300/Mflop. 8 refs., 4 figs.« less
Code of Federal Regulations, 2012 CFR
2012-10-01
..., the following definitions apply to this subchapter: Act means the Social Security Act. ANSI stands for... required documents. Electronic media means: (1) Electronic storage media including memory devices in computers (hard drives) and any removable/transportable digital memory medium, such as magnetic tape or disk...
Code of Federal Regulations, 2011 CFR
2011-10-01
..., the following definitions apply to this subchapter: Act means the Social Security Act. ANSI stands for... required documents. Electronic media means: (1) Electronic storage media including memory devices in computers (hard drives) and any removable/transportable digital memory medium, such as magnetic tape or disk...
Code of Federal Regulations, 2010 CFR
2010-10-01
..., the following definitions apply to this subchapter: Act means the Social Security Act. ANSI stands for... required documents. Electronic media means: (1) Electronic storage media including memory devices in computers (hard drives) and any removable/transportable digital memory medium, such as magnetic tape or disk...
Computing with volatile memristors: an application of non-pinched hysteresis
NASA Astrophysics Data System (ADS)
Pershin, Y. V.; Shevchenko, S. N.
2017-02-01
The possibility of in-memory computing with volatile memristive devices, namely, memristors requiring a power source to sustain their memory, is demonstrated theoretically. We have adopted a hysteretic graphene-based field emission structure as a prototype of a volatile memristor, which is characterized by a non-pinched hysteresis loop. A memristive model of the structure is developed and used to simulate a polymorphic circuit implementing stateful logic gates, such as the material implication. Specific regions of parameter space realizing useful logic functions are identified. Our results are applicable to other realizations of volatile memory devices, such as certain NEMS switches.
Power and Performance Trade-offs for Space Time Adaptive Processing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gawande, Nitin A.; Manzano Franco, Joseph B.; Tumeo, Antonino
Computational efficiency – performance relative to power or energy – is one of the most important concerns when designing RADAR processing systems. This paper analyzes power and performance trade-offs for a typical Space Time Adaptive Processing (STAP) application. We study STAP implementations for CUDA and OpenMP on two computationally efficient architectures, Intel Haswell Core I7-4770TE and NVIDIA Kayla with a GK208 GPU. We analyze the power and performance of STAP’s computationally intensive kernels across the two hardware testbeds. We also show the impact and trade-offs of GPU optimization techniques. We show that data parallelism can be exploited for efficient implementationmore » on the Haswell CPU architecture. The GPU architecture is able to process large size data sets without increase in power requirement. The use of shared memory has a significant impact on the power requirement for the GPU. A balance between the use of shared memory and main memory access leads to an improved performance in a typical STAP application.« less
Accelerating 3D Elastic Wave Equations on Knights Landing based Intel Xeon Phi processors
NASA Astrophysics Data System (ADS)
Sourouri, Mohammed; Birger Raknes, Espen
2017-04-01
In advanced imaging methods like reverse-time migration (RTM) and full waveform inversion (FWI) the elastic wave equation (EWE) is numerically solved many times to create the seismic image or the elastic parameter model update. Thus, it is essential to optimize the solution time for solving the EWE as this will have a major impact on the total computational cost in running RTM or FWI. From a computational point of view applications implementing EWEs are associated with two major challenges. The first challenge is the amount of memory-bound computations involved, while the second challenge is the execution of such computations over very large datasets. So far, multi-core processors have not been able to tackle these two challenges, which eventually led to the adoption of accelerators such as Graphics Processing Units (GPUs). Compared to conventional CPUs, GPUs are densely populated with many floating-point units and fast memory, a type of architecture that has proven to map well to many scientific computations. Despite its architectural advantages, full-scale adoption of accelerators has yet to materialize. First, accelerators require a significant programming effort imposed by programming models such as CUDA or OpenCL. Second, accelerators come with a limited amount of memory, which also require explicit data transfers between the CPU and the accelerator over the slow PCI bus. The second generation of the Xeon Phi processor based on the Knights Landing (KNL) architecture, promises the computational capabilities of an accelerator but require the same programming effort as traditional multi-core processors. The high computational performance is realized through many integrated cores (number of cores and tiles and memory varies with the model) organized in tiles that are connected via a 2D mesh based interconnect. In contrary to accelerators, KNL is a self-hosted system, meaning explicit data transfers over the PCI bus are no longer required. However, like most accelerators, KNL sports a memory subsystem consisting of low-level caches and 16GB of high-bandwidth MCDRAM memory. For capacity computing, up to 400GB of conventional DDR4 memory is provided. Such a strict hierarchical memory layout means that data locality is imperative if the true potential of this product is to be harnessed. In this work, we study a series of optimizations specifically targeting KNL for our EWE based application to reduce the time-to-solution time for the following 3D model sizes in grid points: 1283, 2563 and 5123. We compare the results with an optimized version for multi-core CPUs running on a dual-socket Xeon E5 2680v3 system using OpenMP. Our initial naive implementation on the KNL is roughly 20% faster than the multi-core version, but by using only one thread per core and careful memory placement using the memkind library, we could achieve higher speedups. Additionally, by using the MCDRAM as cache for problem sizes that are smaller than 16 GB further performance improvements were unlocked. Depending on the problem size, our overall results indicate that the KNL based system is approximately 2.2x faster than the 24-core Xeon E5 2680v3 system, with only modest changes to the code.
Avoiding and tolerating latency in large-scale next-generation shared-memory multiprocessors
NASA Technical Reports Server (NTRS)
Probst, David K.
1993-01-01
A scalable solution to the memory-latency problem is necessary to prevent the large latencies of synchronization and memory operations inherent in large-scale shared-memory multiprocessors from reducing high performance. We distinguish latency avoidance and latency tolerance. Latency is avoided when data is brought to nearby locales for future reference. Latency is tolerated when references are overlapped with other computation. Latency-avoiding locales include: processor registers, data caches used temporally, and nearby memory modules. Tolerating communication latency requires parallelism, allowing the overlap of communication and computation. Latency-tolerating techniques include: vector pipelining, data caches used spatially, prefetching in various forms, and multithreading in various forms. Relaxing the consistency model permits increased use of avoidance and tolerance techniques. Each model is a mapping from the program text to sets of partial orders on program operations; it is a convention about which temporal precedences among program operations are necessary. Information about temporal locality and parallelism constrains the use of avoidance and tolerance techniques. Suitable architectural primitives and compiler technology are required to exploit the increased freedom to reorder and overlap operations in relaxed models.
Distributed simulation using a real-time shared memory network
NASA Technical Reports Server (NTRS)
Simon, Donald L.; Mattern, Duane L.; Wong, Edmond; Musgrave, Jeffrey L.
1993-01-01
The Advanced Control Technology Branch of the NASA Lewis Research Center performs research in the area of advanced digital controls for aeronautic and space propulsion systems. This work requires the real-time implementation of both control software and complex dynamical models of the propulsion system. We are implementing these systems in a distributed, multi-vendor computer environment. Therefore, a need exists for real-time communication and synchronization between the distributed multi-vendor computers. A shared memory network is a potential solution which offers several advantages over other real-time communication approaches. A candidate shared memory network was tested for basic performance. The shared memory network was then used to implement a distributed simulation of a ramjet engine. The accuracy and execution time of the distributed simulation was measured and compared to the performance of the non-partitioned simulation. The ease of partitioning the simulation, the minimal time required to develop for communication between the processors and the resulting execution time all indicate that the shared memory network is a real-time communication technique worthy of serious consideration.
NASA Technical Reports Server (NTRS)
Denning, Peter J.
1988-01-01
The worm, Trojan horse, bacterium, and virus are destructive programs that attack information stored in a computer's memory. Virus programs, which propagate by incorporating copies of themselves into other programs, are a growing menace in the late-1980s world of unprotected, networked workstations and personal computers. Limited immunity is offered by memory protection hardware, digitally authenticated object programs,and antibody programs that kill specific viruses. Additional immunity can be gained from the practice of digital hygiene, primarily the refusal to use software from untrusted sources. Full immunity requires attention in a social dimension, the accountability of programmers.
2014-01-01
Background Network-based learning algorithms for automated function prediction (AFP) are negatively affected by the limited coverage of experimental data and limited a priori known functional annotations. As a consequence their application to model organisms is often restricted to well characterized biological processes and pathways, and their effectiveness with poorly annotated species is relatively limited. A possible solution to this problem might consist in the construction of big networks including multiple species, but this in turn poses challenging computational problems, due to the scalability limitations of existing algorithms and the main memory requirements induced by the construction of big networks. Distributed computation or the usage of big computers could in principle respond to these issues, but raises further algorithmic problems and require resources not satisfiable with simple off-the-shelf computers. Results We propose a novel framework for scalable network-based learning of multi-species protein functions based on both a local implementation of existing algorithms and the adoption of innovative technologies: we solve “locally” the AFP problem, by designing “vertex-centric” implementations of network-based algorithms, but we do not give up thinking “globally” by exploiting the overall topology of the network. This is made possible by the adoption of secondary memory-based technologies that allow the efficient use of the large memory available on disks, thus overcoming the main memory limitations of modern off-the-shelf computers. This approach has been applied to the analysis of a large multi-species network including more than 300 species of bacteria and to a network with more than 200,000 proteins belonging to 13 Eukaryotic species. To our knowledge this is the first work where secondary-memory based network analysis has been applied to multi-species function prediction using biological networks with hundreds of thousands of proteins. Conclusions The combination of these algorithmic and technological approaches makes feasible the analysis of large multi-species networks using ordinary computers with limited speed and primary memory, and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using well-equipped stand-alone machines. PMID:24843788
Exploiting short-term memory in soft body dynamics as a computational resource
Nakajima, K.; Li, T.; Hauser, H.; Pfeifer, R.
2014-01-01
Soft materials are not only highly deformable, but they also possess rich and diverse body dynamics. Soft body dynamics exhibit a variety of properties, including nonlinearity, elasticity and potentially infinitely many degrees of freedom. Here, we demonstrate that such soft body dynamics can be employed to conduct certain types of computation. Using body dynamics generated from a soft silicone arm, we show that they can be exploited to emulate functions that require memory and to embed robust closed-loop control into the arm. Our results suggest that soft body dynamics have a short-term memory and can serve as a computational resource. This finding paves the way towards exploiting passive body dynamics for control of a large class of underactuated systems. PMID:25185579
Program Design for Retrospective Searches on Large Data Bases
ERIC Educational Resources Information Center
Thiel, L. H.; Heaps, H. S.
1972-01-01
Retrospective search of large data bases requires development of special techniques for automatic compression of data and minimization of the number of input-output operations to the computer files. The computer program should require a relatively small amount of internal memory. This paper describes the structure of such a program. (9 references)…
Supercomputer requirements for selected disciplines important to aerospace
NASA Technical Reports Server (NTRS)
Peterson, Victor L.; Kim, John; Holst, Terry L.; Deiwert, George S.; Cooper, David M.; Watson, Andrew B.; Bailey, F. Ron
1989-01-01
Speed and memory requirements placed on supercomputers by five different disciplines important to aerospace are discussed and compared with the capabilities of various existing computers and those projected to be available before the end of this century. The disciplines chosen for consideration are turbulence physics, aerodynamics, aerothermodynamics, chemistry, and human vision modeling. Example results for problems illustrative of those currently being solved in each of the disciplines are presented and discussed. Limitations imposed on physical modeling and geometrical complexity by the need to obtain solutions in practical amounts of time are identified. Computational challenges for the future, for which either some or all of the current limitations are removed, are described. Meeting some of the challenges will require computer speeds in excess of exaflop/s (10 to the 18th flop/s) and memories in excess of petawords (10 to the 15th words).
Cost aware cache replacement policy in shared last-level cache for hybrid memory based fog computing
NASA Astrophysics Data System (ADS)
Jia, Gangyong; Han, Guangjie; Wang, Hao; Wang, Feng
2018-04-01
Fog computing requires a large main memory capacity to decrease latency and increase the Quality of Service (QoS). However, dynamic random access memory (DRAM), the commonly used random access memory, cannot be included into a fog computing system due to its high consumption of power. In recent years, non-volatile memories (NVM) such as Phase-Change Memory (PCM) and Spin-transfer torque RAM (STT-RAM) with their low power consumption have emerged to replace DRAM. Moreover, the currently proposed hybrid main memory, consisting of both DRAM and NVM, have shown promising advantages in terms of scalability and power consumption. However, the drawbacks of NVM, such as long read/write latency give rise to potential problems leading to asymmetric cache misses in the hybrid main memory. Current last level cache (LLC) policies are based on the unified miss cost, and result in poor performance in LLC and add to the cost of using NVM. In order to minimize the cache miss cost in the hybrid main memory, we propose a cost aware cache replacement policy (CACRP) that reduces the number of cache misses from NVM and improves the cache performance for a hybrid memory system. Experimental results show that our CACRP behaves better in LLC performance, improving performance up to 43.6% (15.5% on average) compared to LRU.
Protein sequence comparison based on K-string dictionary.
Yu, Chenglong; He, Rong L; Yau, Stephen S-T
2013-10-25
The current K-string-based protein sequence comparisons require large amounts of computer memory because the dimension of the protein vector representation grows exponentially with K. In this paper, we propose a novel concept, the "K-string dictionary", to solve this high-dimensional problem. It allows us to use a much lower dimensional K-string-based frequency or probability vector to represent a protein, and thus significantly reduce the computer memory requirements for their implementation. Furthermore, based on this new concept, we use Singular Value Decomposition to analyze real protein datasets, and the improved protein vector representation allows us to obtain accurate gene trees. © 2013.
Projected phase-change memory devices.
Koelmans, Wabe W; Sebastian, Abu; Jonnalagadda, Vara Prasad; Krebs, Daniel; Dellmann, Laurent; Eleftheriou, Evangelos
2015-09-03
Nanoscale memory devices, whose resistance depends on the history of the electric signals applied, could become critical building blocks in new computing paradigms, such as brain-inspired computing and memcomputing. However, there are key challenges to overcome, such as the high programming power required, noise and resistance drift. Here, to address these, we present the concept of a projected memory device, whose distinguishing feature is that the physical mechanism of resistance storage is decoupled from the information-retrieval process. We designed and fabricated projected memory devices based on the phase-change storage mechanism and convincingly demonstrate the concept through detailed experimentation, supported by extensive modelling and finite-element simulations. The projected memory devices exhibit remarkably low drift and excellent noise performance. We also demonstrate active control and customization of the programming characteristics of the device that reliably realize a multitude of resistance states.
NASA Technical Reports Server (NTRS)
Sohn, Andrew; Biswas, Rupak; Simon, Horst D.
1996-01-01
The computational requirements for an adaptive solution of unsteady problems change as the simulation progresses. This causes workload imbalance among processors on a parallel machine which, in turn, requires significant data movement at runtime. We present a new dynamic load-balancing framework, called JOVE, that balances the workload across all processors with a global view. Whenever the computational mesh is adapted, JOVE is activated to eliminate the load imbalance. JOVE has been implemented on an IBM SP2 distributed-memory machine in MPI for portability. Experimental results for two model meshes demonstrate that mesh adaption with load balancing gives more than a sixfold improvement over one without load balancing. We also show that JOVE gives a 24-fold speedup on 64 processors compared to sequential execution.
Overset grid applications on distributed memory MIMD computers
NASA Technical Reports Server (NTRS)
Chawla, Kalpana; Weeratunga, Sisira
1994-01-01
Analysis of modern aerospace vehicles requires the computation of flowfields about complex three dimensional geometries composed of regions with varying spatial resolution requirements. Overset grid methods allow the use of proven structured grid flow solvers to address the twin issues of geometrical complexity and the resolution variation by decomposing the complex physical domain into a collection of overlapping subdomains. This flexibility is accompanied by the need for irregular intergrid boundary communication among the overlapping component grids. This study investigates a strategy for implementing such a static overset grid implicit flow solver on distributed memory, MIMD computers; i.e., the 128 node Intel iPSC/860 and the 208 node Intel Paragon. Performance data for two composite grid configurations characteristic of those encountered in present day aerodynamic analysis are also presented.
ERIC Educational Resources Information Center
Lazzaro, Joseph J.
1993-01-01
Describes adaptive technology for personal computers that accommodate disabled users and may require special equipment including hardware, memory, expansion slots, and ports. Highlights include vision aids, including speech synthesizers, magnification, braille, and optical character recognition (OCR); hearing adaptations; motor-impaired…
Noise reduction in optically controlled quantum memory
NASA Astrophysics Data System (ADS)
Ma, Lijun; Slattery, Oliver; Tang, Xiao
2018-05-01
Quantum memory is an essential tool for quantum communications systems and quantum computers. An important category of quantum memory, called optically controlled quantum memory, uses a strong classical beam to control the storage and re-emission of a single-photon signal through an atomic ensemble. In this type of memory, the residual light from the strong classical control beam can cause severe noise and degrade the system performance significantly. Efficiently suppressing this noise is a requirement for the successful implementation of optically controlled quantum memories. In this paper, we briefly introduce the latest and most common approaches to quantum memory and review the various noise-reduction techniques used in implementing them.
Automated quantitative muscle biopsy analysis system
NASA Technical Reports Server (NTRS)
Castleman, Kenneth R. (Inventor)
1980-01-01
An automated system to aid the diagnosis of neuromuscular diseases by producing fiber size histograms utilizing histochemically stained muscle biopsy tissue. Televised images of the microscopic fibers are processed electronically by a multi-microprocessor computer, which isolates, measures, and classifies the fibers and displays the fiber size distribution. The architecture of the multi-microprocessor computer, which is iterated to any required degree of complexity, features a series of individual microprocessors P.sub.n each receiving data from a shared memory M.sub.n-1 and outputing processed data to a separate shared memory M.sub.n+1 under control of a program stored in dedicated memory M.sub.n.
In-situ, In-Memory Stateful Vector Logic Operations based on Voltage Controlled Magnetic Anisotropy.
Jaiswal, Akhilesh; Agrawal, Amogh; Roy, Kaushik
2018-04-10
Recently, the exponential increase in compute requirements demanded by emerging applications like artificial intelligence, Internet of things, etc. have rendered the state-of-art von-Neumann machines inefficient in terms of energy and throughput owing to the well-known von-Neumann bottleneck. A promising approach to mitigate the bottleneck is to do computations as close to the memory units as possible. One extreme possibility is to do in-situ Boolean logic computations by using stateful devices. Stateful devices are those that can act both as a compute engine and storage device, simultaneously. We propose such stateful, vector, in-memory operations using voltage controlled magnetic anisotropy (VCMA) effect in magnetic tunnel junctions (MTJ). Our proposal is based on the well known manufacturable 1-transistor - 1-MTJ bit-cell and does not require any modifications in the bit-cell circuit or the magnetic device. Instead, we leverage the very physics of the VCMA effect to enable stateful computations. Specifically, we exploit the voltage asymmetry of the VCMA effect to construct stateful IMP (implication) gate and use the precessional switching dynamics of the VCMA devices to propose a massively parallel NOT operation. Further, we show that other gates like AND, OR, NAND, NOR, NIMP (complement of implication) can be implemented using multi-cycle operations.
AFTOMS Technology Issues and Alternatives Report
1989-12-01
color , resolu- power requirements, physi- tion; memory , processor speed; cal and weather rugged- IAN interfaces, etc,) f,: these ness. display...Telephone and Telegraph 3 CD-I Compact Disk - Interactive CD-ROM Compact Disk-Read Only Memory CGM Computer Graphics Metafile CNWDI Critical Nuclear...Database Management System RFP Request For Proposal 3 RFS Remote File System ROM Read Only Memory 3 S SA-ALC San Antonio Air Logistics Center 3 SAC
LSG: An External-Memory Tool to Compute String Graphs for Next-Generation Sequencing Data Assembly.
Bonizzoni, Paola; Vedova, Gianluca Della; Pirola, Yuri; Previtali, Marco; Rizzi, Raffaella
2016-03-01
The large amount of short read data that has to be assembled in future applications, such as in metagenomics or cancer genomics, strongly motivates the investigation of disk-based approaches to index next-generation sequencing (NGS) data. Positive results in this direction stimulate the investigation of efficient external memory algorithms for de novo assembly from NGS data. Our article is also motivated by the open problem of designing a space-efficient algorithm to compute a string graph using an indexing procedure based on the Burrows-Wheeler transform (BWT). We have developed a disk-based algorithm for computing string graphs in external memory: the light string graph (LSG). LSG relies on a new representation of the FM-index that is exploited to use an amount of main memory requirement that is independent from the size of the data set. Moreover, we have developed a pipeline for genome assembly from NGS data that integrates LSG with the assembly step of SGA (Simpson and Durbin, 2012 ), a state-of-the-art string graph-based assembler, and uses BEETL for indexing the input data. LSG is open source software and is available online. We have analyzed our implementation on a 875-million read whole-genome dataset, on which LSG has built the string graph using only 1GB of main memory (reducing the memory occupation by a factor of 50 with respect to SGA), while requiring slightly more than twice the time than SGA. The analysis of the entire pipeline shows an important decrease in memory usage, while managing to have only a moderate increase in the running time.
Sando, Yusuke; Barada, Daisuke; Jackin, Boaz Jessie; Yatagai, Toyohiko
2017-07-10
This study proposes a method to reduce the calculation time and memory usage required for calculating cylindrical computer-generated holograms. The wavefront on the cylindrical observation surface is represented as a convolution integral in the 3D Fourier domain. The Fourier transformation of the kernel function involving this convolution integral is analytically performed using a Bessel function expansion. The analytical solution can drastically reduce the calculation time and the memory usage without any cost, compared with the numerical method using fast Fourier transform to Fourier transform the kernel function. In this study, we present the analytical derivation, the efficient calculation of Bessel function series, and a numerical simulation. Furthermore, we demonstrate the effectiveness of the analytical solution through comparisons of calculation time and memory usage.
Improving Memory Error Handling Using Linux
DOE Office of Scientific and Technical Information (OSTI.GOV)
Carlton, Michael Andrew; Blanchard, Sean P.; Debardeleben, Nathan A.
As supercomputers continue to get faster and more powerful in the future, they will also have more nodes. If nothing is done, then the amount of memory in supercomputer clusters will soon grow large enough that memory failures will be unmanageable to deal with by manually replacing memory DIMMs. "Improving Memory Error Handling Using Linux" is a process oriented method to solve this problem by using the Linux kernel to disable (offline) faulty memory pages containing bad addresses, preventing them from being used again by a process. The process of offlining memory pages simplifies error handling and results in reducingmore » both hardware and manpower costs required to run Los Alamos National Laboratory (LANL) clusters. This process will be necessary for the future of supercomputing to allow the development of exascale computers. It will not be feasible without memory error handling to manually replace the number of DIMMs that will fail daily on a machine consisting of 32-128 petabytes of memory. Testing reveals the process of offlining memory pages works and is relatively simple to use. As more and more testing is conducted, the entire process will be automated within the high-performance computing (HPC) monitoring software, Zenoss, at LANL.« less
A fast sequence assembly method based on compressed data structures.
Liang, Peifeng; Zhang, Yancong; Lin, Kui; Hu, Jinglu
2014-01-01
Assembling a large genome using next generation sequencing reads requires large computer memory and a long execution time. To reduce these requirements, a memory and time efficient assembler is presented from applying FM-index in JR-Assembler, called FMJ-Assembler, where FM stand for FMR-index derived from the FM-index and BWT and J for jumping extension. The FMJ-Assembler uses expanded FM-index and BWT to compress data of reads to save memory and jumping extension method make it faster in CPU time. An extensive comparison of the FMJ-Assembler with current assemblers shows that the FMJ-Assembler achieves a better or comparable overall assembly quality and requires lower memory use and less CPU time. All these advantages of the FMJ-Assembler indicate that the FMJ-Assembler will be an efficient assembly method in next generation sequencing technology.
Tian, He; Zhao, Lianfeng; Wang, Xuefeng; Yeh, Yao-Wen; Yao, Nan; Rand, Barry P; Ren, Tian-Ling
2017-12-26
Extremely low energy consumption neuromorphic computing is required to achieve massively parallel information processing on par with the human brain. To achieve this goal, resistive memories based on materials with ionic transport and extremely low operating current are required. Extremely low operating current allows for low power operation by minimizing the program, erase, and read currents. However, materials currently used in resistive memories, such as defective HfO x , AlO x , TaO x , etc., cannot suppress electronic transport (i.e., leakage current) while allowing good ionic transport. Here, we show that 2D Ruddlesden-Popper phase hybrid lead bromide perovskite single crystals are promising materials for low operating current nanodevice applications because of their mixed electronic and ionic transport and ease of fabrication. Ionic transport in the exfoliated 2D perovskite layer is evident via the migration of bromide ions. Filaments with a diameter of approximately 20 nm are visualized, and resistive memories with extremely low program current down to 10 pA are achieved, a value at least 1 order of magnitude lower than conventional materials. The ionic migration and diffusion as an artificial synapse is realized in the 2D layered perovskites at the pA level, which can enable extremely low energy neuromorphic computing.
Hardware implementation of CMAC neural network with reduced storage requirement.
Ker, J S; Kuo, Y H; Wen, R C; Liu, B D
1997-01-01
The cerebellar model articulation controller (CMAC) neural network has the advantages of fast convergence speed and low computation complexity. However, it suffers from a low storage space utilization rate on weight memory. In this paper, we propose a direct weight address mapping approach, which can reduce the required weight memory size with a utilization rate near 100%. Based on such an address mapping approach, we developed a pipeline architecture to efficiently perform the addressing operations. The proposed direct weight address mapping approach also speeds up the computation for the generation of weight addresses. Besides, a CMAC hardware prototype used for color calibration has been implemented to confirm the proposed approach and architecture.
Programmable data communications controller requirements
NASA Technical Reports Server (NTRS)
1977-01-01
The design requirements for a Programmable Data Communications Controller (PDCC) that reduces the difficulties in attaching data terminal equipment to a computer are presented. The PDCC is an interface between the computer I/O channel and the bit serial communication lines. Each communication line is supported by a communication port that handles all line control functions and performs most terminal control functions. The port is fabricated on a printed circuit board that plugs into a card chassis, mating with a connector that is joined to all other card stations by a data bus. Ports are individually programmable; each includes a microprocessor, a programmable read-only memory for instruction storage, and a random access memory for data storage.
Distributed Memory Parallel Computing with SEAWAT
NASA Astrophysics Data System (ADS)
Verkaik, J.; Huizer, S.; van Engelen, J.; Oude Essink, G.; Ram, R.; Vuik, K.
2017-12-01
Fresh groundwater reserves in coastal aquifers are threatened by sea-level rise, extreme weather conditions, increasing urbanization and associated groundwater extraction rates. To counteract these threats, accurate high-resolution numerical models are required to optimize the management of these precious reserves. The major model drawbacks are long run times and large memory requirements, limiting the predictive power of these models. Distributed memory parallel computing is an efficient technique for reducing run times and memory requirements, where the problem is divided over multiple processor cores. A new Parallel Krylov Solver (PKS) for SEAWAT is presented. PKS has recently been applied to MODFLOW and includes Conjugate Gradient (CG) and Biconjugate Gradient Stabilized (BiCGSTAB) linear accelerators. Both accelerators are preconditioned by an overlapping additive Schwarz preconditioner in a way that: a) subdomains are partitioned using Recursive Coordinate Bisection (RCB) load balancing, b) each subdomain uses local memory only and communicates with other subdomains by Message Passing Interface (MPI) within the linear accelerator, c) it is fully integrated in SEAWAT. Within SEAWAT, the PKS-CG solver replaces the Preconditioned Conjugate Gradient (PCG) solver for solving the variable-density groundwater flow equation and the PKS-BiCGSTAB solver replaces the Generalized Conjugate Gradient (GCG) solver for solving the advection-diffusion equation. PKS supports the third-order Total Variation Diminishing (TVD) scheme for computing advection. Benchmarks were performed on the Dutch national supercomputer (https://userinfo.surfsara.nl/systems/cartesius) using up to 128 cores, for a synthetic 3D Henry model (100 million cells) and the real-life Sand Engine model ( 10 million cells). The Sand Engine model was used to investigate the potential effect of the long-term morphological evolution of a large sand replenishment and climate change on fresh groundwater resources. Speed-ups up to 40 were obtained with the new PKS solver.
Klooster, Nathaniel B.; Cook, Susan W.; Uc, Ergun Y.; Duff, Melissa C.
2015-01-01
Hand gesture, a ubiquitous feature of human interaction, facilitates communication. Gesture also facilitates new learning, benefiting speakers and listeners alike. Thus, gestures must impact cognition beyond simply supporting the expression of already-formed ideas. However, the cognitive and neural mechanisms supporting the effects of gesture on learning and memory are largely unknown. We hypothesized that gesture's ability to drive new learning is supported by procedural memory and that procedural memory deficits will disrupt gesture production and comprehension. We tested this proposal in patients with intact declarative memory, but impaired procedural memory as a consequence of Parkinson's disease (PD), and healthy comparison participants with intact declarative and procedural memory. In separate experiments, we manipulated the gestures participants saw and produced in a Tower of Hanoi (TOH) paradigm. In the first experiment, participants solved the task either on a physical board, requiring high arching movements to manipulate the discs from peg to peg, or on a computer, requiring only flat, sideways movements of the mouse. When explaining the task, healthy participants with intact procedural memory displayed evidence of their previous experience in their gestures, producing higher, more arching hand gestures after solving on a physical board, and smaller, flatter gestures after solving on a computer. In the second experiment, healthy participants who saw high arching hand gestures in an explanation prior to solving the task subsequently moved the mouse with significantly higher curvature than those who saw smaller, flatter gestures prior to solving the task. These patterns were absent in both gesture production and comprehension experiments in patients with procedural memory impairment. These findings suggest that the procedural memory system supports the ability of gesture to drive new learning. PMID:25628556
MPEG-1 low-cost encoder solution
NASA Astrophysics Data System (ADS)
Grueger, Klaus; Schirrmeister, Frank; Filor, Lutz; von Reventlow, Christian; Schneider, Ulrich; Mueller, Gerriet; Sefzik, Nicolai; Fiedrich, Sven
1995-02-01
A solution for real-time compression of digital YCRCB video data to an MPEG-1 video data stream has been developed. As an additional option, motion JPEG and video telephone streams (H.261) can be generated. For MPEG-1, up to two bidirectional predicted images are supported. The required computational power for motion estimation and DCT/IDCT, memory size and memory bandwidth have been the main challenges. The design uses fast-page-mode memory accesses and requires only one single 80 ns EDO-DRAM with 256 X 16 organization for video encoding. This can be achieved only by using adequate access and coding strategies. The architecture consists of an input processing and filter unit, a memory interface, a motion estimation unit, a motion compensation unit, a DCT unit, a quantization control, a VLC unit and a bus interface. For using the available memory bandwidth by the processing tasks, a fixed schedule for memory accesses has been applied, that can be interrupted for asynchronous events. The motion estimation unit implements a highly sophisticated hierarchical search strategy based on block matching. The DCT unit uses a separated fast-DCT flowgraph realized by a switchable hardware unit for both DCT and IDCT operation. By appropriate multiplexing, only one multiplier is required for: DCT, quantization, inverse quantization, and IDCT. The VLC unit generates the video-stream up to the video sequence layer and is directly coupled with an intelligent bus-interface. Thus, the assembly of video, audio and system data can easily be performed by the host computer. Having a relatively low complexity and only small requirements for DRAM circuits, the developed solution can be applied to low-cost encoding products for consumer electronics.
Signal and noise extraction from analog memory elements for neuromorphic computing.
Gong, N; Idé, T; Kim, S; Boybat, I; Sebastian, A; Narayanan, V; Ando, T
2018-05-29
Dense crossbar arrays of non-volatile memory (NVM) can potentially enable massively parallel and highly energy-efficient neuromorphic computing systems. The key requirements for the NVM elements are continuous (analog-like) conductance tuning capability and switching symmetry with acceptable noise levels. However, most NVM devices show non-linear and asymmetric switching behaviors. Such non-linear behaviors render separation of signal and noise extremely difficult with conventional characterization techniques. In this study, we establish a practical methodology based on Gaussian process regression to address this issue. The methodology is agnostic to switching mechanisms and applicable to various NVM devices. We show tradeoff between switching symmetry and signal-to-noise ratio for HfO 2 -based resistive random access memory. Then, we characterize 1000 phase-change memory devices based on Ge 2 Sb 2 Te 5 and separate total variability into device-to-device variability and inherent randomness from individual devices. These results highlight the usefulness of our methodology to realize ideal NVM devices for neuromorphic computing.
ERIC Educational Resources Information Center
Zambon, Franco
This study sought to determine a useful frequency for refreshing students' memories of complex procedures that involved a formal computer language. Students were required to execute the Microsoft Disc Operating System (MS-DOS) commands for "copy,""backup," and "restore." A total of 126 college students enrolled in six…
Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Gaeke, Brian R.; Husbands, Parry; Li, Xiaoye S.; Oliker, Leonid; Yelick, Katherine A.; Biegel, Bryan (Technical Monitor)
2002-01-01
The increasing gap between processor and memory performance has lead to new architectural models for memory-intensive applications. In this paper, we explore the performance of a set of memory-intensive benchmarks and use them to compare the performance of conventional cache-based microprocessors to a mixed logic and DRAM processor called VIRAM. The benchmarks are based on problem statements, rather than specific implementations, and in each case we explore the fundamental hardware requirements of the problem, as well as alternative algorithms and data structures that can help expose fine-grained parallelism or simplify memory access patterns. The benchmarks are characterized by their memory access patterns, their basic control structures, and the ratio of computation to memory operation.
Helicopter In-Flight Monitoring System Second Generation (HIMS II).
1983-08-01
acquisition cycle. B. Computer Chassis CPU (DEC LSI-II/2) -- Executes instructions contained in the memory. 32K memory (DEC MSVII-DD) --Contains program...when the operator executes command #2, 3, or 5 (display data). New cartridges can be inserted as required for truly unlimited, continuous data...is called bootstrapping. The software, which is stored on a tape cartridge, is loaded into memory by execution of a small program stored in read-only
Exploiting short-term memory in soft body dynamics as a computational resource.
Nakajima, K; Li, T; Hauser, H; Pfeifer, R
2014-11-06
Soft materials are not only highly deformable, but they also possess rich and diverse body dynamics. Soft body dynamics exhibit a variety of properties, including nonlinearity, elasticity and potentially infinitely many degrees of freedom. Here, we demonstrate that such soft body dynamics can be employed to conduct certain types of computation. Using body dynamics generated from a soft silicone arm, we show that they can be exploited to emulate functions that require memory and to embed robust closed-loop control into the arm. Our results suggest that soft body dynamics have a short-term memory and can serve as a computational resource. This finding paves the way towards exploiting passive body dynamics for control of a large class of underactuated systems. © 2014 The Author(s) Published by the Royal Society. All rights reserved.
Autonomic Closure for Turbulent Flows Using Approximate Bayesian Computation
NASA Astrophysics Data System (ADS)
Doronina, Olga; Christopher, Jason; Hamlington, Peter; Dahm, Werner
2017-11-01
Autonomic closure is a new technique for achieving fully adaptive and physically accurate closure of coarse-grained turbulent flow governing equations, such as those solved in large eddy simulations (LES). Although autonomic closure has been shown in recent a priori tests to more accurately represent unclosed terms than do dynamic versions of traditional LES models, the computational cost of the approach makes it challenging to implement for simulations of practical turbulent flows at realistically high Reynolds numbers. The optimization step used in the approach introduces large matrices that must be inverted and is highly memory intensive. In order to reduce memory requirements, here we propose to use approximate Bayesian computation (ABC) in place of the optimization step, thereby yielding a computationally-efficient implementation of autonomic closure that trades memory-intensive for processor-intensive computations. The latter challenge can be overcome as co-processors such as general purpose graphical processing units become increasingly available on current generation petascale and exascale supercomputers. In this work, we outline the formulation of ABC-enabled autonomic closure and present initial results demonstrating the accuracy and computational cost of the approach.
Efficient implementation of a 3-dimensional ADI method on the iPSC/860
DOE Office of Scientific and Technical Information (OSTI.GOV)
Van der Wijngaart, R.F.
1993-12-31
A comparison is made between several domain decomposition strategies for the solution of three-dimensional partial differential equations on a MIMD distributed memory parallel computer. The grids used are structured, and the numerical algorithm is ADI. Important implementation issues regarding load balancing, storage requirements, network latency, and overlap of computations and communications are discussed. Results of the solution of the three-dimensional heat equation on the Intel iPSC/860 are presented for the three most viable methods. It is found that the Bruno-Cappello decomposition delivers optimal computational speed through an almost complete elimination of processor idle time, while providing good memory efficiency.
Evolution of a standard microprocessor-based space computer
NASA Technical Reports Server (NTRS)
Fernandez, M.
1980-01-01
An existing in inventory computer hardware/software package (B-1 RFS/ECM) was repackaged and applied to multiple missile/space programs. Concurrent with the application efforts, low risk modifications were made to the computer from program to program to take advantage of newer, advanced technology and to meet increasingly more demanding requirements (computational and memory capabilities, longer life, and fault tolerant autonomy). It is concluded that microprocessors hold promise in a number of critical areas for future space computer applications. However, the benefits of the DoD VHSIC Program are required and the old proliferation problem must be revised.
NMDA Receptors Are Not Required for Pattern Completion During Associative Memory Recall
Gu, Yiran; Cui, Zhenzhong; Tsien, Joe Z.
2011-01-01
Pattern completion, the ability to retrieve complete memories initiated by subsets of external cues, has been a major focus of many computation models. A previously study reports that such pattern completion requires NMDA receptors in the hippocampus. However, such a claim was derived from a non-inducible gene knockout experiment in which the NMDA receptors were absent throughout all stages of memory processes as well as animal's adult life. This raises the critical question regarding whether the previously described results were truly resulting from the requirement of the NMDA receptors in retrieval. Here, we have examined the role of the NMDA receptors in pattern completion via inducible knockout of NMDA receptors limited to the memory retrieval stage. By using two independent mouse lines, we found that inducible knockout mice, lacking NMDA receptor in either forebrain or hippocampus CA1 region at the time of memory retrieval, exhibited normal recall of associative spatial reference memory regardless of whether retrievals took place under full-cue or partial-cue conditions. Moreover, systemic antagonism of NMDA receptor during retention tests also had no effect on full-cue or partial-cue recall of spatial water maze memories. Thus, both genetic and pharmacological experiments collectively demonstrate that pattern completion during spatial associative memory recall does not require the NMDA receptor in the hippocampus or forebrain. PMID:21559402
Operate a Nuclear Power Plant.
ERIC Educational Resources Information Center
Frimpter, Bonnie J.; And Others
1983-01-01
Describes classroom use of a computer program originally published in Creative Computing magazine. "The Nuclear Power Plant" (runs on Apple II with 48K memory) simulates the operating of a nuclear generating station, requiring students to make decisions as they assume the task of managing the plant. (JN)
Parallel Computing for Probabilistic Response Analysis of High Temperature Composites
NASA Technical Reports Server (NTRS)
Sues, R. H.; Lua, Y. J.; Smith, M. D.
1994-01-01
The objective of this Phase I research was to establish the required software and hardware strategies to achieve large scale parallelism in solving PCM problems. To meet this objective, several investigations were conducted. First, we identified the multiple levels of parallelism in PCM and the computational strategies to exploit these parallelisms. Next, several software and hardware efficiency investigations were conducted. These involved the use of three different parallel programming paradigms and solution of two example problems on both a shared-memory multiprocessor and a distributed-memory network of workstations.
Design of an online EEG based neurofeedback game for enhancing attention and memory.
Thomas, Kavitha P; Vinod, A P; Guan, Cuntai
2013-01-01
Brain-Computer Interface (BCI) is an alternative communication and control channel between brain and computer which finds applications in neuroprosthetics, brain wave controlled computer games etc. This paper proposes an Electroencephalogram (EEG) based neurofeedback computer game that allows the player to control the game with the help of attention based brain signals. The proposed game protocol requires the player to memorize a set of numbers in a matrix, and to correctly fill the matrix using his attention. The attention level of the player is quantified using sample entropy features of EEG. The statistically significant performance improvement of five healthy subjects after playing a number of game sessions demonstrates the effectiveness of the proposed game in enhancing their concentration and memory skills.
Online learning in optical tomography: a stochastic approach
NASA Astrophysics Data System (ADS)
Chen, Ke; Li, Qin; Liu, Jian-Guo
2018-07-01
We study the inverse problem of radiative transfer equation (RTE) using stochastic gradient descent method (SGD) in this paper. Mathematically, optical tomography amounts to recovering the optical parameters in RTE using the incoming–outgoing pair of light intensity. We formulate it as a PDE-constraint optimization problem, where the mismatch of computed and measured outgoing data is minimized with same initial data and RTE constraint. The memory and computation cost it requires, however, is typically prohibitive, especially in high dimensional space. Smart iterative solvers that only use partial information in each step is called for thereafter. Stochastic gradient descent method is an online learning algorithm that randomly selects data for minimizing the mismatch. It requires minimum memory and computation, and advances fast, therefore perfectly serves the purpose. In this paper we formulate the problem, in both nonlinear and its linearized setting, apply SGD algorithm and analyze the convergence performance.
A Bitslice Implementation of Anderson's Attack on A5/1
NASA Astrophysics Data System (ADS)
Bulavintsev, Vadim; Semenov, Alexander; Zaikin, Oleg; Kochemazov, Stepan
2018-03-01
The A5/1 keystream generator is a part of Global System for Mobile Communications (GSM) protocol, employed in cellular networks all over the world. Its cryptographic resistance was extensively analyzed in dozens of papers. However, almost all corresponding methods either employ a specific hardware or require an extensive preprocessing stage and significant amounts of memory. In the present study, a bitslice variant of Anderson's Attack on A5/1 is implemented. It requires very little computer memory and no preprocessing. Moreover, the attack can be made even more efficient by harnessing the computing power of modern Graphics Processing Units (GPUs). As a result, using commonly available GPUs this method can quite efficiently recover the secret key using only 64 bits of keystream. To test the performance of the implementation, a volunteer computing project was launched. 10 instances of A5/1 cryptanalysis have been successfully solved in this project in a single week.
NASA Technical Reports Server (NTRS)
Hou, Gene
1998-01-01
Sensitivity analysis is a technique for determining derivatives of system responses with respect to design parameters. Among many methods available for sensitivity analysis, automatic differentiation has been proven through many applications in fluid dynamics and structural mechanics to be an accurate and easy method for obtaining derivatives. Nevertheless, the method can be computational expensive and can require a high memory space. This project will apply an automatic differentiation tool, ADIFOR, to a p-version finite element code to obtain first- and second- order then-nal derivatives, respectively. The focus of the study is on the implementation process and the performance of the ADIFOR-enhanced codes for sensitivity analysis in terms of memory requirement, computational efficiency, and accuracy.
System, methods and apparatus for program optimization for multi-threaded processor architectures
Bastoul, Cedric; Lethin, Richard A; Leung, Allen K; Meister, Benoit J; Szilagyi, Peter; Vasilache, Nicolas T; Wohlford, David E
2015-01-06
Methods, apparatus and computer software product for source code optimization are provided. In an exemplary embodiment, a first custom computing apparatus is used to optimize the execution of source code on a second computing apparatus. In this embodiment, the first custom computing apparatus contains a memory, a storage medium and at least one processor with at least one multi-stage execution unit. The second computing apparatus contains at least two multi-stage execution units that allow for parallel execution of tasks. The first custom computing apparatus optimizes the code for parallelism, locality of operations and contiguity of memory accesses on the second computing apparatus. This Abstract is provided for the sole purpose of complying with the Abstract requirement rules. This Abstract is submitted with the explicit understanding that it will not be used to interpret or to limit the scope or the meaning of the claims.
A facilitative effect of negative affective valence on working memory.
Gotoh, Fumiko; Kikuchi, Tadashi; Olofsson, Ulrich
2010-06-01
Previous studies have shown that negatively valenced information impaired working memory performance due to an attention-capturing effect. The present study examined whether negative valence could also facilitate working memory. Affective words (negative, neutral, positive) were used as retro-cues in a working memory task that required participants to remember colors at different spatial locations on a computer screen. Following the cue, a target detection task was used to either shift attention to a different location or keep attention at the same location as the retro-cue. Finally, participants were required to discriminate the cued color from a set of distractors. It was found that negative cues yielded shorter response times (RTs) in the attention-shift condition and longer RTs in the attention-stay condition, compared with neutral and positive cues. The results suggest that negative affective valence may enhance working memory performance (RTs), provided that attention can be disengaged.
NASA Technical Reports Server (NTRS)
Bell, L. D.; Boer, E.; Ostraat, M.; Brongersma, M. L.; Flagan, R. C.; Atwater, H. A.
2000-01-01
NASA requirements for computing and memory for microspacecraft emphasize high density, low power, small size, and radiation hardness. The distributed nature of storage elements in nanocrystal floating-gate memories leads to intrinsic fault tolerance and radiation hardness. Conventional floating-gate non-volatile memories are more susceptible to radiation damage. Nanocrystal-based memories also offer the possibility of faster, lower power operation. In the pursuit of filling these requirements, the following tasks have been accomplished: (1) Si nanocrystal charging has been accomplished with conducting-tip AFM; (2) Both individual nanocrystals on an oxide surface and nanocrystals formed by implantation have been charged; (3) Discharging is consistent with tunneling through a field-lowered oxide barrier; (4) Modeling of the response of the AFM to trapped charge has allowed estimation of the quantity of trapped charge; and (5) Initial attempts to fabricate competitive nanocrystal non-volatile memories have been extremely successful.
Advanced computer architecture specification for automated weld systems
NASA Technical Reports Server (NTRS)
Katsinis, Constantine
1994-01-01
This report describes the requirements for an advanced automated weld system and the associated computer architecture, and defines the overall system specification from a broad perspective. According to the requirements of welding procedures as they relate to an integrated multiaxis motion control and sensor architecture, the computer system requirements are developed based on a proven multiple-processor architecture with an expandable, distributed-memory, single global bus architecture, containing individual processors which are assigned to specific tasks that support sensor or control processes. The specified architecture is sufficiently flexible to integrate previously developed equipment, be upgradable and allow on-site modifications.
NASA Technical Reports Server (NTRS)
Hall, William A. (Inventor)
1993-01-01
A bus programmable slave module card for use in a computer control system is disclosed which comprises a master computer and one or more slave computer modules interfacing by means of a bus. Each slave module includes its own microprocessor, memory, and control program for acting as a single loop controller. The slave card includes a plurality of memory means (S1, S2...) corresponding to a like plurality of memory devices (C1, C2...) in the master computer, for each slave memory means its own communication lines connectable through the bus with memory communication lines of an associated memory device in the master computer, and a one-way electronic door which is switchable to either a closed condition or a one-way open condition. With the door closed, communication lines between master computer memory (C1, C2...) and slave memory (S1, S2...) are blocked. In the one-way open condition invention, the memory communication lines or each slave memory means (S1, S2...) connect with the memory communication lines of its associated memory device (C1, C2...) in the master computer, and the memory devices (C1, C2...) of the master computer and slave card are electrically parallel such that information seen by the master's memory is also seen by the slave's memory. The slave card is also connectable to a switch for electronically removing the slave microprocessor from the system. With the master computer and the slave card in programming mode relationship, and the slave microprocessor electronically removed from the system, loading a program in the memory devices (C1, C2...) of the master accomplishes a parallel loading into the memory devices (S1, S2...) of the slave.
PREPARING FOR EXASCALE: ORNL Leadership Computing Application Requirements and Strategy
DOE Office of Scientific and Technical Information (OSTI.GOV)
Joubert, Wayne; Kothe, Douglas B; Nam, Hai Ah
2009-12-01
In 2009 the Oak Ridge Leadership Computing Facility (OLCF), a U.S. Department of Energy (DOE) facility at the Oak Ridge National Laboratory (ORNL) National Center for Computational Sciences (NCCS), elicited petascale computational science requirements from leading computational scientists in the international science community. This effort targeted science teams whose projects received large computer allocation awards on OLCF systems. A clear finding of this process was that in order to reach their science goals over the next several years, multiple projects will require computational resources in excess of an order of magnitude more powerful than those currently available. Additionally, for themore » longer term, next-generation science will require computing platforms of exascale capability in order to reach DOE science objectives over the next decade. It is generally recognized that achieving exascale in the proposed time frame will require disruptive changes in computer hardware and software. Processor hardware will become necessarily heterogeneous and will include accelerator technologies. Software must undergo the concomitant changes needed to extract the available performance from this heterogeneous hardware. This disruption portends to be substantial, not unlike the change to the message passing paradigm in the computational science community over 20 years ago. Since technological disruptions take time to assimilate, we must aggressively embark on this course of change now, to insure that science applications and their underlying programming models are mature and ready when exascale computing arrives. This includes initiation of application readiness efforts to adapt existing codes to heterogeneous architectures, support of relevant software tools, and procurement of next-generation hardware testbeds for porting and testing codes. The 2009 OLCF requirements process identified numerous actions necessary to meet this challenge: (1) Hardware capabilities must be advanced on multiple fronts, including peak flops, node memory capacity, interconnect latency, interconnect bandwidth, and memory bandwidth. (2) Effective parallel programming interfaces must be developed to exploit the power of emerging hardware. (3) Science application teams must now begin to adapt and reformulate application codes to the new hardware and software, typified by hierarchical and disparate layers of compute, memory and concurrency. (4) Algorithm research must be realigned to exploit this hierarchy. (5) When possible, mathematical libraries must be used to encapsulate the required operations in an efficient and useful way. (6) Software tools must be developed to make the new hardware more usable. (7) Science application software must be improved to cope with the increasing complexity of computing systems. (8) Data management efforts must be readied for the larger quantities of data generated by larger, more accurate science models. Requirements elicitation, analysis, validation, and management comprise a difficult and inexact process, particularly in periods of technological change. Nonetheless, the OLCF requirements modeling process is becoming increasingly quantitative and actionable, as the process becomes more developed and mature, and the process this year has identified clear and concrete steps to be taken. This report discloses (1) the fundamental science case driving the need for the next generation of computer hardware, (2) application usage trends that illustrate the science need, (3) application performance characteristics that drive the need for increased hardware capabilities, (4) resource and process requirements that make the development and deployment of science applications on next-generation hardware successful, and (5) summary recommendations for the required next steps within the computer and computational science communities.« less
FPGA cluster for high-performance AO real-time control system
NASA Astrophysics Data System (ADS)
Geng, Deli; Goodsell, Stephen J.; Basden, Alastair G.; Dipper, Nigel A.; Myers, Richard M.; Saunter, Chris D.
2006-06-01
Whilst the high throughput and low latency requirements for the next generation AO real-time control systems have posed a significant challenge to von Neumann architecture processor systems, the Field Programmable Gate Array (FPGA) has emerged as a long term solution with high performance on throughput and excellent predictability on latency. Moreover, FPGA devices have highly capable programmable interfacing, which lead to more highly integrated system. Nevertheless, a single FPGA is still not enough: multiple FPGA devices need to be clustered to perform the required subaperture processing and the reconstruction computation. In an AO real-time control system, the memory bandwidth is often the bottleneck of the system, simply because a vast amount of supporting data, e.g. pixel calibration maps and the reconstruction matrix, need to be accessed within a short period. The cluster, as a general computing architecture, has excellent scalability in processing throughput, memory bandwidth, memory capacity, and communication bandwidth. Problems, such as task distribution, node communication, system verification, are discussed.
Protecting solid-state spins from a strongly coupled environment
NASA Astrophysics Data System (ADS)
Chen, Mo; Calvin Sun, Won Kyu; Saha, Kasturi; Jaskula, Jean-Christophe; Cappellaro, Paola
2018-06-01
Quantum memories are critical for solid-state quantum computing devices and a good quantum memory requires both long storage time and fast read/write operations. A promising system is the nitrogen-vacancy (NV) center in diamond, where the NV electronic spin serves as the computing qubit and a nearby nuclear spin as the memory qubit. Previous works used remote, weakly coupled 13C nuclear spins, trading read/write speed for long storage time. Here we focus instead on the intrinsic strongly coupled 14N nuclear spin. We first quantitatively understand its decoherence mechanism, identifying as its source the electronic spin that acts as a quantum fluctuator. We then propose a scheme to protect the quantum memory from the fluctuating noise by applying dynamical decoupling on the environment itself. We demonstrate a factor of 3 enhancement of the storage time in a proof-of-principle experiment, showing the potential for a quantum memory that combines fast operation with long coherence time.
Cabral, Henrique O; Vinck, Martin; Fouquet, Celine; Pennartz, Cyriel M A; Rondi-Reig, Laure; Battaglia, Francesco P
2014-01-22
Place coding in the hippocampus requires flexible combination of sensory inputs (e.g., environmental and self-motion information) with memory of past events. We show that mouse CA1 hippocampal spatial representations may either be anchored to external landmarks (place memory) or reflect memorized sequences of cell assemblies depending on the behavioral strategy spontaneously selected. These computational modalities correspond to different CA1 dynamical states, as expressed by theta and low- and high-frequency gamma oscillations, when switching from place to sequence memory-based processing. These changes are consistent with a shift from entorhinal to CA3 input dominance on CA1. In mice with a deletion of forebrain NMDA receptors, the ability of place cells to maintain a map based on sequence memory is selectively impaired and oscillatory dynamics are correspondingly altered, suggesting that oscillations contribute to selecting behaviorally appropriate computations in the hippocampus and that NMDA receptors are crucial for this function. Copyright © 2014 Elsevier Inc. All rights reserved.
Contrasting single and multi-component working-memory systems in dual tasking.
Nijboer, Menno; Borst, Jelmer; van Rijn, Hedderik; Taatgen, Niels
2016-05-01
Working memory can be a major source of interference in dual tasking. However, there is no consensus on whether this interference is the result of a single working memory bottleneck, or of interactions between different working memory components that together form a complete working-memory system. We report a behavioral and an fMRI dataset in which working memory requirements are manipulated during multitasking. We show that a computational cognitive model that assumes a distributed version of working memory accounts for both behavioral and neuroimaging data better than a model that takes a more centralized approach. The model's working memory consists of an attentional focus, declarative memory, and a subvocalized rehearsal mechanism. Thus, the data and model favor an account where working memory interference in dual tasking is the result of interactions between different resources that together form a working-memory system. Copyright © 2016 Elsevier Inc. All rights reserved.
Advanced information processing system: Local system services
NASA Technical Reports Server (NTRS)
Burkhardt, Laura; Alger, Linda; Whittredge, Roy; Stasiowski, Peter
1989-01-01
The Advanced Information Processing System (AIPS) is a multi-computer architecture composed of hardware and software building blocks that can be configured to meet a broad range of application requirements. The hardware building blocks are fault-tolerant, general-purpose computers, fault-and damage-tolerant networks (both computer and input/output), and interfaces between the networks and the computers. The software building blocks are the major software functions: local system services, input/output, system services, inter-computer system services, and the system manager. The foundation of the local system services is an operating system with the functions required for a traditional real-time multi-tasking computer, such as task scheduling, inter-task communication, memory management, interrupt handling, and time maintenance. Resting on this foundation are the redundancy management functions necessary in a redundant computer and the status reporting functions required for an operator interface. The functional requirements, functional design and detailed specifications for all the local system services are documented.
Low cost computer subsystem for the Solar Electric Propulsion Stage (SEPS)
NASA Technical Reports Server (NTRS)
1975-01-01
The Solar Electric Propulsion Stage (SEPS) subsystem which consists of the computer, custom input/output (I/O) unit, and tape recorder for mass storage of telemetry data was studied. Computer software and interface requirements were developed along with computer and I/O unit design parameters. Redundancy implementation was emphasized. Reliability analysis was performed for the complete command computer sybsystem. A SEPS fault tolerant memory breadboard was constructed and its operation demonstrated.
Multiplexed memory-insensitive quantum repeaters.
Collins, O A; Jenkins, S D; Kuzmich, A; Kennedy, T A B
2007-02-09
Long-distance quantum communication via distant pairs of entangled quantum bits (qubits) is the first step towards secure message transmission and distributed quantum computing. To date, the most promising proposals require quantum repeaters to mitigate the exponential decrease in communication rate due to optical fiber losses. However, these are exquisitely sensitive to the lifetimes of their memory elements. We propose a multiplexing of quantum nodes that should enable the construction of quantum networks that are largely insensitive to the coherence times of the quantum memory elements.
A Comprehensive Study on Energy Efficiency and Performance of Flash-based SSD
DOE Office of Scientific and Technical Information (OSTI.GOV)
Park, Seon-Yeon; Kim, Youngjae; Urgaonkar, Bhuvan
2011-01-01
Use of flash memory as a storage medium is becoming popular in diverse computing environments. However, because of differences in interface, flash memory requires a hard-disk-emulation layer, called FTL (flash translation layer). Although the FTL enables flash memory storages to replace conventional hard disks, it induces significant computational and space overhead. Despite the low power consumption of flash memory, this overhead leads to significant power consumption in an overall storage system. In this paper, we analyze the characteristics of flash-based storage devices from the viewpoint of power consumption and energy efficiency by using various methodologies. First, we utilize simulation tomore » investigate the interior operation of flash-based storage of flash-based storages. Subsequently, we measure the performance and energy efficiency of commodity flash-based SSDs by using microbenchmarks to identify the block-device level characteristics and macrobenchmarks to reveal their filesystem level characteristics.« less
40 CFR 1033.110 - Emission diagnostics-general requirements.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 32 2010-07-01 2010-07-01 false Emission diagnostics-general requirements. 1033.110 Section 1033.110 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED... engine operation. (d) Record and store in computer memory any diagnostic trouble codes showing a...
Method and apparatus for managing access to a memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
DeBenedictis, Erik
A method and apparatus for managing access to a memory of a computing system. A controller transforms a plurality of operations that represent a computing job into an operational memory layout that reduces a size of a selected portion of the memory that needs to be accessed to perform the computing job. The controller stores the operational memory layout in a plurality of memory cells within the selected portion of the memory. The controller controls a sequence by which a processor in the computing system accesses the memory to perform the computing job using the operational memory layout. The operationalmore » memory layout reduces an amount of energy consumed by the processor to perform the computing job.« less
NASA Astrophysics Data System (ADS)
Wei, Hui; Gong, Guanghong; Li, Ni
2017-10-01
Computer-generated hologram (CGH) is a promising 3D display technology while it is challenged by heavy computation load and vast memory requirement. To solve these problems, a depth compensating CGH calculation method based on symmetry and similarity of zone plates is proposed and implemented on graphics processing unit (GPU). An improved LUT method is put forward to compute the distances between object points and hologram pixels in the XY direction. The concept of depth compensating factor is defined and used for calculating the holograms of points with different depth positions instead of layer-based methods. The proposed method is suitable for arbitrary sampling objects with lower memory usage and higher computational efficiency compared to other CGH methods. The effectiveness of the proposed method is validated by numerical and optical experiments.
Kramer, Tobias; Noack, Matthias; Reinefeld, Alexander; Rodríguez, Mirta; Zelinskyy, Yaroslav
2018-06-11
Time- and frequency-resolved optical signals provide insights into the properties of light-harvesting molecular complexes, including excitation energies, dipole strengths and orientations, as well as in the exciton energy flow through the complex. The hierarchical equations of motion (HEOM) provide a unifying theory, which allows one to study the combined effects of system-environment dissipation and non-Markovian memory without making restrictive assumptions about weak or strong couplings or separability of vibrational and electronic degrees of freedom. With increasing system size the exact solution of the open quantum system dynamics requires memory and compute resources beyond a single compute node. To overcome this barrier, we developed a scalable variant of HEOM. Our distributed memory HEOM, DM-HEOM, is a universal tool for open quantum system dynamics. It is used to accurately compute all experimentally accessible time- and frequency-resolved processes in light-harvesting molecular complexes with arbitrary system-environment couplings for a wide range of temperatures and complex sizes. © 2018 Wiley Periodicals, Inc. © 2018 Wiley Periodicals, Inc.
Moradi, Saber; Qiao, Ning; Stefanini, Fabio; Indiveri, Giacomo
2018-02-01
Neuromorphic computing systems comprise networks of neurons that use asynchronous events for both computation and communication. This type of representation offers several advantages in terms of bandwidth and power consumption in neuromorphic electronic systems. However, managing the traffic of asynchronous events in large scale systems is a daunting task, both in terms of circuit complexity and memory requirements. Here, we present a novel routing methodology that employs both hierarchical and mesh routing strategies and combines heterogeneous memory structures for minimizing both memory requirements and latency, while maximizing programming flexibility to support a wide range of event-based neural network architectures, through parameter configuration. We validated the proposed scheme in a prototype multicore neuromorphic processor chip that employs hybrid analog/digital circuits for emulating synapse and neuron dynamics together with asynchronous digital circuits for managing the address-event traffic. We present a theoretical analysis of the proposed connectivity scheme, describe the methods and circuits used to implement such scheme, and characterize the prototype chip. Finally, we demonstrate the use of the neuromorphic processor with a convolutional neural network for the real-time classification of visual symbols being flashed to a dynamic vision sensor (DVS) at high speed.
Equation solvers for distributed-memory computers
NASA Technical Reports Server (NTRS)
Storaasli, Olaf O.
1994-01-01
A large number of scientific and engineering problems require the rapid solution of large systems of simultaneous equations. The performance of parallel computers in this area now dwarfs traditional vector computers by nearly an order of magnitude. This talk describes the major issues involved in parallel equation solvers with particular emphasis on the Intel Paragon, IBM SP-1 and SP-2 processors.
Software beamforming: comparison between a phased array and synthetic transmit aperture.
Li, Yen-Feng; Li, Pai-Chi
2011-04-01
The data-transfer and computation requirements are compared between software-based beamforming using a phased array (PA) and a synthetic transmit aperture (STA). The advantages of a software-based architecture are reduced system complexity and lower hardware cost. Although this architecture can be implemented using commercial CPUs or GPUs, the high computation and data-transfer requirements limit its real-time beamforming performance. In particular, transferring the raw rf data from the front-end subsystem to the software back-end remains challenging with current state-of-the-art electronics technologies, which offset the cost advantage of the software back end. This study investigated the tradeoff between the data-transfer and computation requirements. Two beamforming methods based on a PA and STA, respectively, were used: the former requires a higher data transfer rate and the latter requires more memory operations. The beamformers were implemente;d in an NVIDIA GeForce GTX 260 GPU and an Intel core i7 920 CPU. The frame rate of PA beamforming was 42 fps with a 128-element array transducer, with 2048 samples per firing and 189 beams per image (with a 95 MB/frame data-transfer requirement). The frame rate of STA beamforming was 40 fps with 16 firings per image (with an 8 MB/frame data-transfer requirement). Both approaches achieved real-time beamforming performance but each had its own bottleneck. On the one hand, the required data-transfer speed was considerably reduced in STA beamforming, whereas this required more memory operations, which limited the overall computation time. The advantages of the GPU approach over the CPU approach were clearly demonstrated.
MemAxes: Visualization and Analytics for Characterizing Complex Memory Performance Behaviors.
Gimenez, Alfredo; Gamblin, Todd; Jusufi, Ilir; Bhatele, Abhinav; Schulz, Martin; Bremer, Peer-Timo; Hamann, Bernd
2018-07-01
Memory performance is often a major bottleneck for high-performance computing (HPC) applications. Deepening memory hierarchies, complex memory management, and non-uniform access times have made memory performance behavior difficult to characterize, and users require novel, sophisticated tools to analyze and optimize this aspect of their codes. Existing tools target only specific factors of memory performance, such as hardware layout, allocations, or access instructions. However, today's tools do not suffice to characterize the complex relationships between these factors. Further, they require advanced expertise to be used effectively. We present MemAxes, a tool based on a novel approach for analytic-driven visualization of memory performance data. MemAxes uniquely allows users to analyze the different aspects related to memory performance by providing multiple visual contexts for a centralized dataset. We define mappings of sampled memory access data to new and existing visual metaphors, each of which enabling a user to perform different analysis tasks. We present methods to guide user interaction by scoring subsets of the data based on known performance problems. This scoring is used to provide visual cues and automatically extract clusters of interest. We designed MemAxes in collaboration with experts in HPC and demonstrate its effectiveness in case studies.
Adaptive efficient compression of genomes
2012-01-01
Modern high-throughput sequencing technologies are able to generate DNA sequences at an ever increasing rate. In parallel to the decreasing experimental time and cost necessary to produce DNA sequences, computational requirements for analysis and storage of the sequences are steeply increasing. Compression is a key technology to deal with this challenge. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. However, memory requirements of the current algorithms are high and run times often are slow. In this paper, we propose an adaptive, parallel and highly efficient referential sequence compression method which allows fine-tuning of the trade-off between required memory and compression speed. When using 12 MB of memory, our method is for human genomes on-par with the best previous algorithms in terms of compression ratio (400:1) and compression speed. In contrast, it compresses a complete human genome in just 11 seconds when provided with 9 GB of main memory, which is almost three times faster than the best competitor while using less main memory. PMID:23146997
Monitoring item and source information: evidence for a negative generation effect in source memory.
Jurica, P J; Shimamura, A P
1999-07-01
Item memory and source memory were assessed in a task that simulated a social conversation. Participants generated answers to questions or read statements presented by one of three sources (faces on a computer screen). Positive generation effects were observed for item memory. That is, participants remembered topics of conversation better if they were asked questions about the topics than if they simply read statements about topics. However, a negative generation effect occurred for source memory. That is, remembering the source of some information was disrupted if participants were required to answer questions pertaining to that information. These findings support the notion that item and source memory are mediated, as least in part, by different processes during encoding.
Topological computation based on direct magnetic logic communication.
Zhang, Shilei; Baker, Alexander A; Komineas, Stavros; Hesjedal, Thorsten
2015-10-28
Non-uniform magnetic domains with non-trivial topology, such as vortices and skyrmions, are proposed as superior state variables for nonvolatile information storage. So far, the possibility of logic operations using topological objects has not been considered. Here, we demonstrate numerically that the topology of the system plays a significant role for its dynamics, using the example of vortex-antivortex pairs in a planar ferromagnetic film. Utilising the dynamical properties and geometrical confinement, direct logic communication between the topological memory carriers is realised. This way, no additional magnetic-to-electrical conversion is required. More importantly, the information carriers can spontaneously travel up to ~300 nm, for which no spin-polarised current is required. The derived logic scheme enables topological spintronics, which can be integrated into large-scale memory and logic networks capable of complex computations.
NASA Astrophysics Data System (ADS)
Collins, Robert J.; Donaldon, Ross J.; Dunjko, Vedran; Wallden, Petros; Clarke, Patrick J.; Andersson, Erika; Jeffers, John; Buller, Gerald S.
2014-10-01
Classical digital signatures are commonly used in e-mail, electronic financial transactions and other forms of electronic communications to ensure that messages have not been tampered with in transit, and that messages are transferrable. The security of commonly used classical digital signature schemes relies on the computational difficulty of inverting certain mathematical functions. However, at present, there are no such one-way functions which have been proven to be hard to invert. With enough computational resources certain implementations of classical public key cryptosystems can be, and have been, broken with current technology. It is nevertheless possible to construct information-theoretically secure signature schemes, including quantum digital signature schemes. Quantum signature schemes can be made information theoretically secure based on the laws of quantum mechanics, while classical comparable protocols require additional resources such as secret communication and a trusted authority. Early demonstrations of quantum digital signatures required quantum memory, rendering them impractical at present. Our present implementation is based on a protocol that does not require quantum memory. It also uses the new technique of unambiguous quantum state elimination, Here we report experimental results for a test-bed system, recorded with a variety of different operating parameters, along with a discussion of aspects of the system security.
The Science of Computing: Virtual Memory
NASA Technical Reports Server (NTRS)
Denning, Peter J.
1986-01-01
In the March-April issue, I described how a computer's storage system is organized as a hierarchy consisting of cache, main memory, and secondary memory (e.g., disk). The cache and main memory form a subsystem that functions like main memory but attains speeds approaching cache. What happens if a program and its data are too large for the main memory? This is not a frivolous question. Every generation of computer users has been frustrated by insufficient memory. A new line of computers may have sufficient storage for the computations of its predecessor, but new programs will soon exhaust its capacity. In 1960, a longrange planning committee at MIT dared to dream of a computer with 1 million words of main memory. In 1985, the Cray-2 was delivered with 256 million words. Computational physicists dream of computers with 1 billion words. Computer architects have done an outstanding job of enlarging main memories yet they have never kept up with demand. Only the shortsighted believe they can.
Reducing the computational footprint for real-time BCPNN learning
Vogginger, Bernhard; Schüffny, René; Lansner, Anders; Cederström, Love; Partzsch, Johannes; Höppner, Sebastian
2015-01-01
The implementation of synaptic plasticity in neural simulation or neuromorphic hardware is usually very resource-intensive, often requiring a compromise between efficiency and flexibility. A versatile, but computationally-expensive plasticity mechanism is provided by the Bayesian Confidence Propagation Neural Network (BCPNN) paradigm. Building upon Bayesian statistics, and having clear links to biological plasticity processes, the BCPNN learning rule has been applied in many fields, ranging from data classification, associative memory, reward-based learning, probabilistic inference to cortical attractor memory networks. In the spike-based version of this learning rule the pre-, postsynaptic and coincident activity is traced in three low-pass-filtering stages, requiring a total of eight state variables, whose dynamics are typically simulated with the fixed step size Euler method. We derive analytic solutions allowing an efficient event-driven implementation of this learning rule. Further speedup is achieved by first rewriting the model which reduces the number of basic arithmetic operations per update to one half, and second by using look-up tables for the frequently calculated exponential decay. Ultimately, in a typical use case, the simulation using our approach is more than one order of magnitude faster than with the fixed step size Euler method. Aiming for a small memory footprint per BCPNN synapse, we also evaluate the use of fixed-point numbers for the state variables, and assess the number of bits required to achieve same or better accuracy than with the conventional explicit Euler method. All of this will allow a real-time simulation of a reduced cortex model based on BCPNN in high performance computing. More important, with the analytic solution at hand and due to the reduced memory bandwidth, the learning rule can be efficiently implemented in dedicated or existing digital neuromorphic hardware. PMID:25657618
Reducing the computational footprint for real-time BCPNN learning.
Vogginger, Bernhard; Schüffny, René; Lansner, Anders; Cederström, Love; Partzsch, Johannes; Höppner, Sebastian
2015-01-01
The implementation of synaptic plasticity in neural simulation or neuromorphic hardware is usually very resource-intensive, often requiring a compromise between efficiency and flexibility. A versatile, but computationally-expensive plasticity mechanism is provided by the Bayesian Confidence Propagation Neural Network (BCPNN) paradigm. Building upon Bayesian statistics, and having clear links to biological plasticity processes, the BCPNN learning rule has been applied in many fields, ranging from data classification, associative memory, reward-based learning, probabilistic inference to cortical attractor memory networks. In the spike-based version of this learning rule the pre-, postsynaptic and coincident activity is traced in three low-pass-filtering stages, requiring a total of eight state variables, whose dynamics are typically simulated with the fixed step size Euler method. We derive analytic solutions allowing an efficient event-driven implementation of this learning rule. Further speedup is achieved by first rewriting the model which reduces the number of basic arithmetic operations per update to one half, and second by using look-up tables for the frequently calculated exponential decay. Ultimately, in a typical use case, the simulation using our approach is more than one order of magnitude faster than with the fixed step size Euler method. Aiming for a small memory footprint per BCPNN synapse, we also evaluate the use of fixed-point numbers for the state variables, and assess the number of bits required to achieve same or better accuracy than with the conventional explicit Euler method. All of this will allow a real-time simulation of a reduced cortex model based on BCPNN in high performance computing. More important, with the analytic solution at hand and due to the reduced memory bandwidth, the learning rule can be efficiently implemented in dedicated or existing digital neuromorphic hardware.
Wilhelm, Jan; Seewald, Patrick; Del Ben, Mauro; Hutter, Jürg
2016-12-13
We present an algorithm for computing the correlation energy in the random phase approximation (RPA) in a Gaussian basis requiring [Formula: see text] operations and [Formula: see text] memory. The method is based on the resolution of the identity (RI) with the overlap metric, a reformulation of RI-RPA in the Gaussian basis, imaginary time, and imaginary frequency integration techniques, and the use of sparse linear algebra. Additional memory reduction without extra computations can be achieved by an iterative scheme that overcomes the memory bottleneck of canonical RPA implementations. We report a massively parallel implementation that is the key for the application to large systems. Finally, cubic-scaling RPA is applied to a thousand water molecules using a correlation-consistent triple-ζ quality basis.
Compact modeling of CRS devices based on ECM cells for memory, logic and neuromorphic applications.
Linn, E; Menzel, S; Ferch, S; Waser, R
2013-09-27
Dynamic physics-based models of resistive switching devices are of great interest for the realization of complex circuits required for memory, logic and neuromorphic applications. Here, we apply such a model of an electrochemical metallization (ECM) cell to complementary resistive switches (CRSs), which are favorable devices to realize ultra-dense passive crossbar arrays. Since a CRS consists of two resistive switching devices, it is straightforward to apply the dynamic ECM model for CRS simulation with MATLAB and SPICE, enabling study of the device behavior in terms of sweep rate and series resistance variations. Furthermore, typical memory access operations as well as basic implication logic operations can be analyzed, revealing requirements for proper spike and level read operations. This basic understanding facilitates applications of massively parallel computing paradigms required for neuromorphic applications.
ERIC Educational Resources Information Center
Brusco, Michael J.; Kohn, Hans-Friedrich; Stahl, Stephanie
2008-01-01
Dynamic programming methods for matrix permutation problems in combinatorial data analysis can produce globally-optimal solutions for matrices up to size 30x30, but are computationally infeasible for larger matrices because of enormous computer memory requirements. Branch-and-bound methods also guarantee globally-optimal solutions, but computation…
Proteinortho: detection of (co-)orthologs in large-scale analysis.
Lechner, Marcus; Findeiss, Sven; Steiner, Lydia; Marz, Manja; Stadler, Peter F; Prohaska, Sonja J
2011-04-28
Orthology analysis is an important part of data analysis in many areas of bioinformatics such as comparative genomics and molecular phylogenetics. The ever-increasing flood of sequence data, and hence the rapidly increasing number of genomes that can be compared simultaneously, calls for efficient software tools as brute-force approaches with quadratic memory requirements become infeasible in practise. The rapid pace at which new data become available, furthermore, makes it desirable to compute genome-wide orthology relations for a given dataset rather than relying on relations listed in databases. The program Proteinortho described here is a stand-alone tool that is geared towards large datasets and makes use of distributed computing techniques when run on multi-core hardware. It implements an extended version of the reciprocal best alignment heuristic. We apply Proteinortho to compute orthologous proteins in the complete set of all 717 eubacterial genomes available at NCBI at the beginning of 2009. We identified thirty proteins present in 99% of all bacterial proteomes. Proteinortho significantly reduces the required amount of memory for orthology analysis compared to existing tools, allowing such computations to be performed on off-the-shelf hardware.
Genotype Imputation with Millions of Reference Samples
Browning, Brian L.; Browning, Sharon R.
2016-01-01
We present a genotype imputation method that scales to millions of reference samples. The imputation method, based on the Li and Stephens model and implemented in Beagle v.4.1, is parallelized and memory efficient, making it well suited to multi-core computer processors. It achieves fast, accurate, and memory-efficient genotype imputation by restricting the probability model to markers that are genotyped in the target samples and by performing linear interpolation to impute ungenotyped variants. We compare Beagle v.4.1 with Impute2 and Minimac3 by using 1000 Genomes Project data, UK10K Project data, and simulated data. All three methods have similar accuracy but different memory requirements and different computation times. When imputing 10 Mb of sequence data from 50,000 reference samples, Beagle’s throughput was more than 100× greater than Impute2’s throughput on our computer servers. When imputing 10 Mb of sequence data from 200,000 reference samples in VCF format, Minimac3 consumed 26× more memory per computational thread and 15× more CPU time than Beagle. We demonstrate that Beagle v.4.1 scales to much larger reference panels by performing imputation from a simulated reference panel having 5 million samples and a mean marker density of one marker per four base pairs. PMID:26748515
Fractional Steps methods for transient problems on commodity computer architectures
NASA Astrophysics Data System (ADS)
Krotkiewski, M.; Dabrowski, M.; Podladchikov, Y. Y.
2008-12-01
Fractional Steps methods are suitable for modeling transient processes that are central to many geological applications. Low memory requirements and modest computational complexity facilitates calculations on high-resolution three-dimensional models. An efficient implementation of Alternating Direction Implicit/Locally One-Dimensional schemes for an Opteron-based shared memory system is presented. The memory bandwidth usage, the main bottleneck on modern computer architectures, is specially addressed. High efficiency of above 2 GFlops per CPU is sustained for problems of 1 billion degrees of freedom. The optimized sequential implementation of all 1D sweeps is comparable in execution time to copying the used data in the memory. Scalability of the parallel implementation on up to 8 CPUs is close to perfect. Performing one timestep of the Locally One-Dimensional scheme on a system of 1000 3 unknowns on 8 CPUs takes only 11 s. We validate the LOD scheme using a computational model of an isolated inclusion subject to a constant far field flux. Next, we study numerically the evolution of a diffusion front and the effective thermal conductivity of composites consisting of multiple inclusions and compare the results with predictions based on the differential effective medium approach. Finally, application of the developed parabolic solver is suggested for a real-world problem of fluid transport and reactions inside a reservoir.
Improved look-up table method of computer-generated holograms.
Wei, Hui; Gong, Guanghong; Li, Ni
2016-11-10
Heavy computation load and vast memory requirements are major bottlenecks of computer-generated holograms (CGHs), which are promising and challenging in three-dimensional displays. To solve these problems, an improved look-up table (LUT) method suitable for arbitrarily sampled object points is proposed and implemented on a graphics processing unit (GPU) whose reconstructed object quality is consistent with that of the coherent ray-trace (CRT) method. The concept of distance factor is defined, and the distance factors are pre-computed off-line and stored in a look-up table. The results show that while reconstruction quality close to that of the CRT method is obtained, the on-line computation time is dramatically reduced compared with the LUT method on the GPU and the memory usage is lower than that of the novel-LUT considerably. Optical experiments are carried out to validate the effectiveness of the proposed method.
gpuSPHASE-A shared memory caching implementation for 2D SPH using CUDA
NASA Astrophysics Data System (ADS)
Winkler, Daniel; Meister, Michael; Rezavand, Massoud; Rauch, Wolfgang
2017-04-01
Smoothed particle hydrodynamics (SPH) is a meshless Lagrangian method that has been successfully applied to computational fluid dynamics (CFD), solid mechanics and many other multi-physics problems. Using the method to solve transport phenomena in process engineering requires the simulation of several days to weeks of physical time. Based on the high computational demand of CFD such simulations in 3D need a computation time of years so that a reduction to a 2D domain is inevitable. In this paper gpuSPHASE, a new open-source 2D SPH solver implementation for graphics devices, is developed. It is optimized for simulations that must be executed with thousands of frames per second to be computed in reasonable time. A novel caching algorithm for Compute Unified Device Architecture (CUDA) shared memory is proposed and implemented. The software is validated and the performance is evaluated for the well established dambreak test case.
Numerical arc segmentation algorithm for a radio conference-NASARC (version 2.0) technical manual
NASA Technical Reports Server (NTRS)
Whyte, Wayne A., Jr.; Heyward, Ann O.; Ponchak, Denise S.; Spence, Rodney L.; Zuzek, John E.
1987-01-01
The information contained in the NASARC (Version 2.0) Technical Manual (NASA TM-100160) and NASARC (Version 2.0) User's Manual (NASA TM-100161) relates to the state of NASARC software development through October 16, 1987. The Technical Manual describes the Numerical Arc Segmentation Algorithm for a Radio Conference (NASARC) concept and the algorithms used to implement the concept. The User's Manual provides information on computer system considerations, installation instructions, description of input files, and program operating instructions. Significant revisions have been incorporated in the Version 2.0 software. These revisions have enhanced the modeling capabilities of the NASARC procedure while greatly reducing the computer run time and memory requirements. Array dimensions within the software have been structured to fit within the currently available 6-megabyte memory capacity of the International Frequency Registration Board (IFRB) computer facility. A piecewise approach to predetermined arc generation in NASARC (Version 2.0) allows worldwide scenarios to be accommodated within these memory constraints while at the same time effecting an overall reduction in computer run time.
Numerical Arc Segmentation Algorithm for a Radio Conference-NASARC, Version 2.0: User's Manual
NASA Technical Reports Server (NTRS)
Whyte, Wayne A., Jr.; Heyward, Ann O.; Ponchak, Denise S.; Spence, Rodney L.; Zuzek, John E.
1987-01-01
The information contained in the NASARC (Version 2.0) Technical Manual (NASA TM-100160) and the NASARC (Version 2.0) User's Manual (NASA TM-100161) relates to the state of the Numerical Arc Segmentation Algorithm for a Radio Conference (NASARC) software development through October 16, 1987. The technical manual describes the NASARC concept and the algorithms which are used to implement it. The User's Manual provides information on computer system considerations, installation instructions, description of input files, and program operation instructions. Significant revisions have been incorporated in the Version 2.0 software over prior versions. These revisions have enhanced the modeling capabilities of the NASARC procedure while greatly reducing the computer run time and memory requirements. Array dimensions within the software have been structured to fit into the currently available 6-megabyte memory capacity of the International Frequency Registration Board (IFRB) computer facility. A piecewise approach to predetermined arc generation in NASARC (Version 2.0) allows worldwide scenarios to be accommodated within these memory constraints while at the same time reducing computer run time.
Paging memory from random access memory to backing storage in a parallel computer
Archer, Charles J; Blocksome, Michael A; Inglett, Todd A; Ratterman, Joseph D; Smith, Brian E
2013-05-21
Paging memory from random access memory (`RAM`) to backing storage in a parallel computer that includes a plurality of compute nodes, including: executing a data processing application on a virtual machine operating system in a virtual machine on a first compute node; providing, by a second compute node, backing storage for the contents of RAM on the first compute node; and swapping, by the virtual machine operating system in the virtual machine on the first compute node, a page of memory from RAM on the first compute node to the backing storage on the second compute node.
NASA Astrophysics Data System (ADS)
Schmalz, Mark S.; Ritter, Gerhard X.; Caimi, Frank M.
2001-12-01
A wide variety of digital image compression transforms developed for still imaging and broadcast video transmission are unsuitable for Internet video applications due to insufficient compression ratio, poor reconstruction fidelity, or excessive computational requirements. Examples include hierarchical transforms that require all, or large portion of, a source image to reside in memory at one time, transforms that induce significant locking effect at operationally salient compression ratios, and algorithms that require large amounts of floating-point computation. The latter constraint holds especially for video compression by small mobile imaging devices for transmission to, and compression on, platforms such as palmtop computers or personal digital assistants (PDAs). As Internet video requirements for frame rate and resolution increase to produce more detailed, less discontinuous motion sequences, a new class of compression transforms will be needed, especially for small memory models and displays such as those found on PDAs. In this, the third series of papers, we discuss the EBLAST compression transform and its application to Internet communication. Leading transforms for compression of Internet video and still imagery are reviewed and analyzed, including GIF, JPEG, AWIC (wavelet-based), wavelet packets, and SPIHT, whose performance is compared with EBLAST. Performance analysis criteria include time and space complexity and quality of the decompressed image. The latter is determined by rate-distortion data obtained from a database of realistic test images. Discussion also includes issues such as robustness of the compressed format to channel noise. EBLAST has been shown to perform superiorly to JPEG and, unlike current wavelet compression transforms, supports fast implementation on embedded processors with small memory models.
Exascale Hardware Architectures Working Group
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hemmert, S; Ang, J; Chiang, P
2011-03-15
The ASC Exascale Hardware Architecture working group is challenged to provide input on the following areas impacting the future use and usability of potential exascale computer systems: processor, memory, and interconnect architectures, as well as the power and resilience of these systems. Going forward, there are many challenging issues that will need to be addressed. First, power constraints in processor technologies will lead to steady increases in parallelism within a socket. Additionally, all cores may not be fully independent nor fully general purpose. Second, there is a clear trend toward less balanced machines, in terms of compute capability compared tomore » memory and interconnect performance. In order to mitigate the memory issues, memory technologies will introduce 3D stacking, eventually moving on-socket and likely on-die, providing greatly increased bandwidth but unfortunately also likely providing smaller memory capacity per core. Off-socket memory, possibly in the form of non-volatile memory, will create a complex memory hierarchy. Third, communication energy will dominate the energy required to compute, such that interconnect power and bandwidth will have a significant impact. All of the above changes are driven by the need for greatly increased energy efficiency, as current technology will prove unsuitable for exascale, due to unsustainable power requirements of such a system. These changes will have the most significant impact on programming models and algorithms, but they will be felt across all layers of the machine. There is clear need to engage all ASC working groups in planning for how to deal with technological changes of this magnitude. The primary function of the Hardware Architecture Working Group is to facilitate codesign with hardware vendors to ensure future exascale platforms are capable of efficiently supporting the ASC applications, which in turn need to meet the mission needs of the NNSA Stockpile Stewardship Program. This issue is relatively immediate, as there is only a small window of opportunity to influence hardware design for 2018 machines. Given the short timeline a firm co-design methodology with vendors is of prime importance.« less
Masuda, Y; Misztal, I; Legarra, A; Tsuruta, S; Lourenco, D A L; Fragomeni, B O; Aguilar, I
2017-01-01
This paper evaluates an efficient implementation to multiply the inverse of a numerator relationship matrix for genotyped animals () by a vector (). The computation is required for solving mixed model equations in single-step genomic BLUP (ssGBLUP) with the preconditioned conjugate gradient (PCG). The inverse can be decomposed into sparse matrices that are blocks of the sparse inverse of a numerator relationship matrix () including genotyped animals and their ancestors. The elements of were rapidly calculated with the Henderson's rule and stored as sparse matrices in memory. Implementation of was by a series of sparse matrix-vector multiplications. Diagonal elements of , which were required as preconditioners in PCG, were approximated with a Monte Carlo method using 1,000 samples. The efficient implementation of was compared with explicit inversion of with 3 data sets including about 15,000, 81,000, and 570,000 genotyped animals selected from populations with 213,000, 8.2 million, and 10.7 million pedigree animals, respectively. The explicit inversion required 1.8 GB, 49 GB, and 2,415 GB (estimated) of memory, respectively, and 42 s, 56 min, and 13.5 d (estimated), respectively, for the computations. The efficient implementation required <1 MB, 2.9 GB, and 2.3 GB of memory, respectively, and <1 sec, 3 min, and 5 min, respectively, for setting up. Only <1 sec was required for the multiplication in each PCG iteration for any data sets. When the equations in ssGBLUP are solved with the PCG algorithm, is no longer a limiting factor in the computations.
NASA Technical Reports Server (NTRS)
Eidson, T. M.; Erlebacher, G.
1994-01-01
While parallel computers offer significant computational performance, it is generally necessary to evaluate several programming strategies. Two programming strategies for a fairly common problem - a periodic tridiagonal solver - are developed and evaluated. Simple model calculations as well as timing results are presented to evaluate the various strategies. The particular tridiagonal solver evaluated is used in many computational fluid dynamic simulation codes. The feature that makes this algorithm unique is that these simulation codes usually require simultaneous solutions for multiple right-hand-sides (RHS) of the system of equations. Each RHS solutions is independent and thus can be computed in parallel. Thus a Gaussian elimination type algorithm can be used in a parallel computation and the more complicated approaches such as cyclic reduction are not required. The two strategies are a transpose strategy and a distributed solver strategy. For the transpose strategy, the data is moved so that a subset of all the RHS problems is solved on each of the several processors. This usually requires significant data movement between processor memories across a network. The second strategy attempts to have the algorithm allow the data across processor boundaries in a chained manner. This usually requires significantly less data movement. An approach to accomplish this second strategy in a near-perfect load-balanced manner is developed. In addition, an algorithm will be shown to directly transform a sequential Gaussian elimination type algorithm into the parallel chained, load-balanced algorithm.
GPU-accelerated computing for Lagrangian coherent structures of multi-body gravitational regimes
NASA Astrophysics Data System (ADS)
Lin, Mingpei; Xu, Ming; Fu, Xiaoyu
2017-04-01
Based on a well-established theoretical foundation, Lagrangian Coherent Structures (LCSs) have elicited widespread research on the intrinsic structures of dynamical systems in many fields, including the field of astrodynamics. Although the application of LCSs in dynamical problems seems straightforward theoretically, its associated computational cost is prohibitive. We propose a block decomposition algorithm developed on Compute Unified Device Architecture (CUDA) platform for the computation of the LCSs of multi-body gravitational regimes. In order to take advantage of GPU's outstanding computing properties, such as Shared Memory, Constant Memory, and Zero-Copy, the algorithm utilizes a block decomposition strategy to facilitate computation of finite-time Lyapunov exponent (FTLE) fields of arbitrary size and timespan. Simulation results demonstrate that this GPU-based algorithm can satisfy double-precision accuracy requirements and greatly decrease the time needed to calculate final results, increasing speed by approximately 13 times. Additionally, this algorithm can be generalized to various large-scale computing problems, such as particle filters, constellation design, and Monte-Carlo simulation.
SODR Memory Control Buffer Control ASIC
NASA Technical Reports Server (NTRS)
Hodson, Robert F.
1994-01-01
The Spacecraft Optical Disk Recorder (SODR) is a state of the art mass storage system for future NASA missions requiring high transmission rates and a large capacity storage system. This report covers the design and development of an SODR memory buffer control applications specific integrated circuit (ASIC). The memory buffer control ASIC has two primary functions: (1) buffering data to prevent loss of data during disk access times, (2) converting data formats from a high performance parallel interface format to a small computer systems interface format. Ten 144 p in, 50 MHz CMOS ASIC's were designed, fabricated and tested to implement the memory buffer control function.
Performance of FORTRAN floating-point operations on the Flex/32 multicomputer
NASA Technical Reports Server (NTRS)
Crockett, Thomas W.
1987-01-01
A series of experiments has been run to examine the floating-point performance of FORTRAN programs on the Flex/32 (Trademark) computer. The experiments are described, and the timing results are presented. The time required to execute a floating-point operation is found to vary considerbaly depending on a number of factors. One factor of particular interest from an algorithm design standpoint is the difference in speed between common memory accesses and local memory accesses. Common memory accesses were found to be slower, and guidelines are given for determinig when it may be cost effective to copy data from common to local memory.
Realization of reliable solid-state quantum memory for photonic polarization qubit.
Zhou, Zong-Quan; Lin, Wei-Bin; Yang, Ming; Li, Chuan-Feng; Guo, Guang-Can
2012-05-11
Faithfully storing an unknown quantum light state is essential to advanced quantum communication and distributed quantum computation applications. The required quantum memory must have high fidelity to improve the performance of a quantum network. Here we report the reversible transfer of photonic polarization states into collective atomic excitation in a compact solid-state device. The quantum memory is based on an atomic frequency comb (AFC) in rare-earth ion-doped crystals. We obtain up to 0.999 process fidelity for the storage and retrieval process of single-photon-level coherent pulse. This reliable quantum memory is a crucial step toward quantum networks based on solid-state devices.
Ong, Eng Teo; Lee, Heow Pueh; Lim, Kian Meng
2004-09-01
This article presents a fast algorithm for the efficient solution of the Helmholtz equation. The method is based on the translation theory of the multipole expansions. Here, the speedup comes from the convolution nature of the translation operators, which can be evaluated rapidly using fast Fourier transform algorithms. Also, the computations of the translation operators are accelerated by using the recursive formulas developed recently by Gumerov and Duraiswami [SIAM J. Sci. Comput. 25, 1344-1381(2003)]. It is demonstrated that the algorithm can produce good accuracy with a relatively low order of expansion. Efficiency analyses of the algorithm reveal that it has computational complexities of O(Na), where a ranges from 1.05 to 1.24. However, this method requires substantially more memory to store the translation operators as compared to the fast multipole method. Hence, despite its simplicity in implementation, this memory requirement issue may limit the application of this algorithm to solving very large-scale problems.
Parallel Implementation of MAFFT on CUDA-Enabled Graphics Hardware.
Zhu, Xiangyuan; Li, Kenli; Salah, Ahmad; Shi, Lin; Li, Keqin
2015-01-01
Multiple sequence alignment (MSA) constitutes an extremely powerful tool for many biological applications including phylogenetic tree estimation, secondary structure prediction, and critical residue identification. However, aligning large biological sequences with popular tools such as MAFFT requires long runtimes on sequential architectures. Due to the ever increasing sizes of sequence databases, there is increasing demand to accelerate this task. In this paper, we demonstrate how graphic processing units (GPUs), powered by the compute unified device architecture (CUDA), can be used as an efficient computational platform to accelerate the MAFFT algorithm. To fully exploit the GPU's capabilities for accelerating MAFFT, we have optimized the sequence data organization to eliminate the bandwidth bottleneck of memory access, designed a memory allocation and reuse strategy to make full use of limited memory of GPUs, proposed a new modified-run-length encoding (MRLE) scheme to reduce memory consumption, and used high-performance shared memory to speed up I/O operations. Our implementation tested in three NVIDIA GPUs achieves speedup up to 11.28 on a Tesla K20m GPU compared to the sequential MAFFT 7.015.
A decomposition approach to the design of a multiferroic memory bit
NASA Astrophysics Data System (ADS)
Acevedo, Ruben; Liang, Cheng-Yen; Carman, Gregory P.; Sepulveda, Abdon E.
2017-06-01
The objective of this paper is to present a methodology for the design of a memory bit to minimize the energy required to write data at the bit level. By straining a ferromagnetic nickel nano-dot by means of a piezoelectric substrate, its magnetization vector rotates between two stable states defined as a 1 and 0 for digital memory. The memory bit geometry, actuation mechanism and voltage control law were used as design variables. The approach used was to decompose the overall design process into simpler sub-problems whose structure can be exploited for a more efficient solution. This method minimizes the number of fully dynamic coupled finite element analyses required to converge to a near optimal design, thus decreasing the computational time for the design process. An in-plane sample design problem is presented to illustrate the advantages and flexibility of the procedure.
Low-memory iterative density fitting.
Grajciar, Lukáš
2015-07-30
A new low-memory modification of the density fitting approximation based on a combination of a continuous fast multipole method (CFMM) and a preconditioned conjugate gradient solver is presented. Iterative conjugate gradient solver uses preconditioners formed from blocks of the Coulomb metric matrix that decrease the number of iterations needed for convergence by up to one order of magnitude. The matrix-vector products needed within the iterative algorithm are calculated using CFMM, which evaluates them with the linear scaling memory requirements only. Compared with the standard density fitting implementation, up to 15-fold reduction of the memory requirements is achieved for the most efficient preconditioner at a cost of only 25% increase in computational time. The potential of the method is demonstrated by performing density functional theory calculations for zeolite fragment with 2592 atoms and 121,248 auxiliary basis functions on a single 12-core CPU workstation. © 2015 Wiley Periodicals, Inc.
Virtual memory support for distributed computing environments using a shared data object model
NASA Astrophysics Data System (ADS)
Huang, F.; Bacon, J.; Mapp, G.
1995-12-01
Conventional storage management systems provide one interface for accessing memory segments and another for accessing secondary storage objects. This hinders application programming and affects overall system performance due to mandatory data copying and user/kernel boundary crossings, which in the microkernel case may involve context switches. Memory-mapping techniques may be used to provide programmers with a unified view of the storage system. This paper extends such techniques to support a shared data object model for distributed computing environments in which good support for coherence and synchronization is essential. The approach is based on a microkernel, typed memory objects, and integrated coherence control. A microkernel architecture is used to support multiple coherence protocols and the addition of new protocols. Memory objects are typed and applications can choose the most suitable protocols for different types of object to avoid protocol mismatch. Low-level coherence control is integrated with high-level concurrency control so that the number of messages required to maintain memory coherence is reduced and system-wide synchronization is realized without severely impacting the system performance. These features together contribute a novel approach to the support for flexible coherence under application control.
Li, Jian; Bloch, Pavel; Xu, Jing; Sarunic, Marinko V; Shannon, Lesley
2011-05-01
Fourier domain optical coherence tomography (FD-OCT) provides faster line rates, better resolution, and higher sensitivity for noninvasive, in vivo biomedical imaging compared to traditional time domain OCT (TD-OCT). However, because the signal processing for FD-OCT is computationally intensive, real-time FD-OCT applications demand powerful computing platforms to deliver acceptable performance. Graphics processing units (GPUs) have been used as coprocessors to accelerate FD-OCT by leveraging their relatively simple programming model to exploit thread-level parallelism. Unfortunately, GPUs do not "share" memory with their host processors, requiring additional data transfers between the GPU and CPU. In this paper, we implement a complete FD-OCT accelerator on a consumer grade GPU/CPU platform. Our data acquisition system uses spectrometer-based detection and a dual-arm interferometer topology with numerical dispersion compensation for retinal imaging. We demonstrate that the maximum line rate is dictated by the memory transfer time and not the processing time due to the GPU platform's memory model. Finally, we discuss how the performance trends of GPU-based accelerators compare to the expected future requirements of FD-OCT data rates.
Intelligent holographic databases
NASA Astrophysics Data System (ADS)
Barbastathis, George
Memory is a key component of intelligence. In the human brain, physical structure and functionality jointly provide diverse memory modalities at multiple time scales. How could we engineer artificial memories with similar faculties? In this thesis, we attack both hardware and algorithmic aspects of this problem. A good part is devoted to holographic memory architectures, because they meet high capacity and parallelism requirements. We develop and fully characterize shift multiplexing, a novel storage method that simplifies disk head design for holographic disks. We develop and optimize the design of compact refreshable holographic random access memories, showing several ways that 1 Tbit can be stored holographically in volume less than 1 m3, with surface density more than 20 times higher than conventional silicon DRAM integrated circuits. To address the issue of photorefractive volatility, we further develop the two-lambda (dual wavelength) method for shift multiplexing, and combine electrical fixing with angle multiplexing to demonstrate 1,000 multiplexed fixed holograms. Finally, we propose a noise model and an information theoretic metric to optimize the imaging system of a holographic memory, in terms of storage density and error rate. Motivated by the problem of interfacing sensors and memories to a complex system with limited computational resources, we construct a computer game of Desert Survival, built as a high-dimensional non-stationary virtual environment in a competitive setting. The efficacy of episodic learning, implemented as a reinforced Nearest Neighbor scheme, and the probability of winning against a control opponent improve significantly by concentrating the algorithmic effort to the virtual desert neighborhood that emerges as most significant at any time. The generalized computational model combines the autonomous neural network and von Neumann paradigms through a compact, dynamic central representation, which contains the most salient features of the sensory inputs, fused with relevant recollections, reminiscent of the hypothesized cognitive function of awareness. The Declarative Memory is searched both by content and address, suggesting a holographic implementation. The proposed computer architecture may lead to a novel paradigm that solves 'hard' cognitive problems at low cost.
Bhatti, A Aziz
2009-12-01
This study proposes an efficient and improved model of a direct storage bidirectional memory, improved bidirectional associative memory (IBAM), and emphasises the use of nanotechnology for efficient implementation of such large-scale neural network structures at a considerable lower cost reduced complexity, and less area required for implementation. This memory model directly stores the X and Y associated sets of M bipolar binary vectors in the form of (MxN(x)) and (MxN(y)) memory matrices, requires O(N) or about 30% of interconnections with weight strength ranging between +/-1, and is computationally very efficient as compared to sequential, intraconnected and other bidirectional associative memory (BAM) models of outer-product type that require O(N(2)) complex interconnections with weight strength ranging between +/-M. It is shown that it is functionally equivalent to and possesses all attributes of a BAM of outer-product type, and yet it is simple and robust in structure, very large scale integration (VLSI), optical and nanotechnology realisable, modular and expandable neural network bidirectional associative memory model in which the addition or deletion of a pair of vectors does not require changes in the strength of interconnections of the entire memory matrix. The analysis of retrieval process, signal-to-noise ratio, storage capacity and stability of the proposed model as well as of the traditional BAM has been carried out. Constraints on and characteristics of unipolar and bipolar binaries for improved storage and retrieval are discussed. The simulation results show that it has log(e) N times higher storage capacity, superior performance, faster convergence and retrieval time, when compared to traditional sequential and intraconnected bidirectional memories.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Muller, U.A.; Baumle, B.; Kohler, P.
1992-10-01
Music, a DSP-based system with a parallel distributed-memory architecture, provides enormous computing power yet retains the flexibility of a general-purpose computer. Reaching a peak performance of 2.7 Gflops at a significantly lower cost, power consumption, and space requirement than conventional supercomputers, Music is well suited to computationally intensive applications such as neural network simulation. 12 refs., 9 figs., 2 tabs.
Investigation of single crystal ferrite thin films
NASA Technical Reports Server (NTRS)
Mee, J. E.; Besser, P. J.; Elkins, P. E.; Glass, H. L.; Whitcomb, E. C.
1972-01-01
Materials suitable for use in magnetic bubble domain memories were developed for aerospace applications. Practical techniques for the preparation of such materials in forms required for fabrication of computer memory devices were considered. The materials studied were epitaxial films of various compositions of the gallium-substituted yttrium gadolinium iron garnet system. The major emphasis was to determine their bubble properties and the conditions necessary for growing uncracked, high quality films.
Fog computing job scheduling optimization based on bees swarm
NASA Astrophysics Data System (ADS)
Bitam, Salim; Zeadally, Sherali; Mellouk, Abdelhamid
2018-04-01
Fog computing is a new computing architecture, composed of a set of near-user edge devices called fog nodes, which collaborate together in order to perform computational services such as running applications, storing an important amount of data, and transmitting messages. Fog computing extends cloud computing by deploying digital resources at the premise of mobile users. In this new paradigm, management and operating functions, such as job scheduling aim at providing high-performance, cost-effective services requested by mobile users and executed by fog nodes. We propose a new bio-inspired optimization approach called Bees Life Algorithm (BLA) aimed at addressing the job scheduling problem in the fog computing environment. Our proposed approach is based on the optimized distribution of a set of tasks among all the fog computing nodes. The objective is to find an optimal tradeoff between CPU execution time and allocated memory required by fog computing services established by mobile users. Our empirical performance evaluation results demonstrate that the proposal outperforms the traditional particle swarm optimization and genetic algorithm in terms of CPU execution time and allocated memory.
Parallel spatial direct numerical simulations on the Intel iPSC/860 hypercube
NASA Technical Reports Server (NTRS)
Joslin, Ronald D.; Zubair, Mohammad
1993-01-01
The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube is documented. The direct numerical simulation approach is used to compute spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows. The feasibility of using the PSDNS on the hypercube to perform transition studies is examined. The results indicate that the direct numerical simulation approach can effectively be parallelized on a distributed-memory parallel machine. By increasing the number of processors nearly ideal linear speedups are achieved with nonoptimized routines; slower than linear speedups are achieved with optimized (machine dependent library) routines. This slower than linear speedup results because the Fast Fourier Transform (FFT) routine dominates the computational cost and because the routine indicates less than ideal speedups. However with the machine-dependent routines the total computational cost decreases by a factor of 4 to 5 compared with standard FORTRAN routines. The computational cost increases linearly with spanwise wall-normal and streamwise grid refinements. The hypercube with 32 processors was estimated to require approximately twice the amount of Cray supercomputer single processor time to complete a comparable simulation; however it is estimated that a subgrid-scale model which reduces the required number of grid points and becomes a large-eddy simulation (PSLES) would reduce the computational cost and memory requirements by a factor of 10 over the PSDNS. This PSLES implementation would enable transition simulations on the hypercube at a reasonable computational cost.
Reder, Lynne M.; Park, Heekyeong; Kieffaber, Paul D.
2009-01-01
There is a popular hypothesis that performance on implicit and explicit memory tasks reflects 2 distinct memory systems. Explicit memory is said to store those experiences that can be consciously recollected, and implicit memory is said to store experiences and affect subsequent behavior but to be unavailable to conscious awareness. Although this division based on awareness is a useful taxonomy for memory tasks, the authors review the evidence that the unconscious character of implicit memory does not necessitate that it be treated as a separate system of human memory. They also argue that some implicit and explicit memory tasks share the same memory representations and that the important distinction is whether the task (implicit or explicit) requires the formation of a new association. The authors review and critique dissociations from the behavioral, amnesia, and neuroimaging literatures that have been advanced in support of separate explicit and implicit memory systems by highlighting contradictory evidence and by illustrating how the data can be accounted for using a simple computational memory model that assumes the same memory representation for those disparate tasks. PMID:19210052
NASA Technical Reports Server (NTRS)
Peterson, Victor L.; Kim, John; Holst, Terry L.; Deiwert, George S.; Cooper, David M.; Watson, Andrew B.; Bailey, F. Ron
1992-01-01
Report evaluates supercomputer needs of five key disciplines: turbulence physics, aerodynamics, aerothermodynamics, chemistry, and mathematical modeling of human vision. Predicts these fields will require computer speed greater than 10(Sup 18) floating-point operations per second (FLOP's) and memory capacity greater than 10(Sup 15) words. Also, new parallel computer architectures and new structured numerical methods will make necessary speed and capacity available.
26 CFR 1.274-5T - Substantiation requirements (temporary).
Code of Federal Regulations, 2012 CFR
2012-04-01
... the business use of listed property, such as a computer or automobile, prepared in a computer memory... such person. (iii) Primary use of a facility. Section 274(a) (1)(B) and (2)(C) deny a deduction for any... information as shall tend to establish such primary use. Such records of use shall contain— (A) For each use...
26 CFR 1.274-5T - Substantiation requirements (temporary).
Code of Federal Regulations, 2013 CFR
2013-04-01
... the business use of listed property, such as a computer or automobile, prepared in a computer memory... such person. (iii) Primary use of a facility. Section 274(a) (1)(B) and (2)(C) deny a deduction for any... information as shall tend to establish such primary use. Such records of use shall contain— (A) For each use...
26 CFR 1.274-5T - Substantiation requirements (temporary).
Code of Federal Regulations, 2014 CFR
2014-04-01
... the business use of listed property, such as a computer or automobile, prepared in a computer memory... such person. (iii) Primary use of a facility. Section 274(a) (1)(B) and (2)(C) deny a deduction for any... information as shall tend to establish such primary use. Such records of use shall contain— (A) For each use...
Robust image retrieval from noisy inputs using lattice associative memories
NASA Astrophysics Data System (ADS)
Urcid, Gonzalo; Nieves-V., José Angel; García-A., Anmi; Valdiviezo-N., Juan Carlos
2009-02-01
Lattice associative memories also known as morphological associative memories are fully connected feedforward neural networks with no hidden layers, whose computation at each node is carried out with lattice algebra operations. These networks are a relatively recent development in the field of associative memories that has proven to be an alternative way to work with sets of pattern pairs for which the storage and retrieval stages use minimax algebra. Different associative memory models have been proposed to cope with the problem of pattern recall under input degradations, such as occlusions or random noise, where input patterns can be composed of binary or real valued entries. In comparison to these and other artificial neural network memories, lattice algebra based memories display better performance for storage and recall capability; however, the computational techniques devised to achieve that purpose require additional processing or provide partial success when inputs are presented with undetermined noise levels. Robust retrieval capability of an associative memory model is usually expressed by a high percentage of perfect recalls from non-perfect input. The procedure described here uses noise masking defined by simple lattice operations together with appropriate metrics, such as the normalized mean squared error or signal to noise ratio, to boost the recall performance of either the min or max lattice auto-associative memories. Using a single lattice associative memory, illustrative examples are given that demonstrate the enhanced retrieval of correct gray-scale image associations from inputs corrupted with random noise.
Optical memories in digital computing
NASA Technical Reports Server (NTRS)
Alford, C. O.; Gaylord, T. K.
1979-01-01
High capacity optical memories with relatively-high data-transfer rate and multiport simultaneous access capability may serve as basis for new computer architectures. Several computer structures that might profitably use memories are: a) simultaneous record-access system, b) simultaneously-shared memory computer system, and c) parallel digital processing structure.
Efficiently modeling neural networks on massively parallel computers
NASA Technical Reports Server (NTRS)
Farber, Robert M.
1993-01-01
Neural networks are a very useful tool for analyzing and modeling complex real world systems. Applying neural network simulations to real world problems generally involves large amounts of data and massive amounts of computation. To efficiently handle the computational requirements of large problems, we have implemented at Los Alamos a highly efficient neural network compiler for serial computers, vector computers, vector parallel computers, and fine grain SIMD computers such as the CM-2 connection machine. This paper describes the mapping used by the compiler to implement feed-forward backpropagation neural networks for a SIMD (Single Instruction Multiple Data) architecture parallel computer. Thinking Machines Corporation has benchmarked our code at 1.3 billion interconnects per second (approximately 3 gigaflops) on a 64,000 processor CM-2 connection machine (Singer 1990). This mapping is applicable to other SIMD computers and can be implemented on MIMD computers such as the CM-5 connection machine. Our mapping has virtually no communications overhead with the exception of the communications required for a global summation across the processors (which has a sub-linear runtime growth on the order of O(log(number of processors)). We can efficiently model very large neural networks which have many neurons and interconnects and our mapping can extend to arbitrarily large networks (within memory limitations) by merging the memory space of separate processors with fast adjacent processor interprocessor communications. This paper will consider the simulation of only feed forward neural network although this method is extendable to recurrent networks.
A computer program for two-particle generalized coefficients of fractional parentage
NASA Astrophysics Data System (ADS)
Deveikis, A.; Juodagalvis, A.
2008-10-01
We present a FORTRAN90 program GCFP for the calculation of the generalized coefficients of fractional parentage (generalized CFPs or GCFP). The approach is based on the observation that the multi-shell CFPs can be expressed in terms of single-shell CFPs, while the latter can be readily calculated employing a simple enumeration scheme of antisymmetric A-particle states and an efficient method of construction of the idempotent matrix eigenvectors. The program provides fast calculation of GCFPs for a given particle number and produces results possessing numerical uncertainties below the desired tolerance. A single j-shell is defined by four quantum numbers, (e,l,j,t). A supplemental C++ program parGCFP allows calculation to be done in batches and/or in parallel. Program summaryProgram title:GCFP, parGCFP Catalogue identifier: AEBI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEBI_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 17 199 No. of bytes in distributed program, including test data, etc.: 88 658 Distribution format: tar.gz Programming language: FORTRAN 77/90 ( GCFP), C++ ( parGCFP) Computer: Any computer with suitable compilers. The program GCFP requires a FORTRAN 77/90 compiler. The auxiliary program parGCFP requires GNU-C++ compatible compiler, while its parallel version additionally requires MPI-1 standard libraries Operating system: Linux (Ubuntu, Scientific) (all programs), also checked on Windows XP ( GCFP, serial version of parGCFP) RAM: The memory demand depends on the computation and output mode. If this mode is not 4, the program GCFP demands the following amounts of memory on a computer with Linux operating system. It requires around 2 MB of RAM for the A=12 system at E⩽2. Computation of the A=50 particle system requires around 60 MB of RAM at E=0 and ˜70 MB at E=2 (note, however, that the calculation of this system will take a very long time). If the computation and output mode is set to 4, the memory demands by GCFP are significantly larger. Calculation of GCFPs of A=12 system at E=1 requires 145 MB. The program parGCFP requires additional 2.5 and 4.5 MB of memory for the serial and parallel version, respectively. Classification: 17.18 Nature of problem: The program GCFP generates a list of two-particle coefficients of fractional parentage for several j-shells with isospin. Solution method: The method is based on the observation that multishell coefficients of fractional parentage can be expressed in terms of single-shell CFPs [1]. The latter are calculated using the algorithm [2,3] for a spectral decomposition of an antisymmetrization operator matrix Y. The coefficients of fractional parentage are those eigenvectors of the antisymmetrization operator matrix Y that correspond to unit eigenvalues. A computer code for these coefficients is available [4]. The program GCFP offers computation of two-particle multishell coefficients of fractional parentage. The program parGCFP allows a batch calculation using one input file. Sets of GCFPs are independent and can be calculated in parallel. Restrictions:A<86 when E=0 (due to the memory constraints); small numbers of particles allow significantly higher excitations, though the shell with j⩾11/2 cannot get full (it is the implementation constraint). Unusual features: Using the program GCFP it is possible to determine allowed particle configurations without the GCFP computation. The GCFPs can be calculated either for all particle configurations at once or for a specified particle configuration. The values of GCFPs can be printed out with a complete specification in either one file or with the parent and daughter configurations printed in separate files. The latter output mode requires additional time and RAM memory. It is possible to restrict the ( J,T) values of the considered particle configurations. (Here J is the total angular momentum and T is the total isospin of the system.) The program parGCFP produces several result files the number of which equals to the number of particle configurations. To work correctly, the program GCFP needs to be compiled to read parameters from the standard input (the default setting). Running time: It depends on the size of the problem. The minimum time is required, if the computation and output mode ( CompMode) is not 4, but the resulting file is larger. A system with A=12 particles at E=0 (all 9411 GCFPs) took around 1 sec on a Pentium4 2.8 GHz processor with 1 MB L2 cache. The program required about 14 min to calculate all 1.3×10 GCFPs of E=1. The time for all 5.5×10 GCFPs of E=2 was about 53 hours. For this number of particles, the calculation time of both E=0 and E=1 with CompMode = 1 and 4 is nearly the same, when no other processes are running. The case of E=2 could not be calculated with CompMode = 4, because the RAM memory was insufficient. In general, the latter CompMode requires a longer computation time, although the resulting files are smaller in size. The program parGCFP puts virtually no time overhead. Its parallel version speeds-up the calculation. However, the results need to be collected from several files created for each configuration. References: [1] J. Levinsonas, Works of Lithuanian SSR Academy of Sciences 4 (1957) 17. [2] A. Deveikis, A. Bončkus, R. Kalinauskas, Lithuanian Phys. J. 41 (2001) 3. [3] A. Deveikis, R.K. Kalinauskas, B.R. Barrett, Ann. Phys. 296 (2002) 287. [4] A. Deveikis, Comput. Phys. Comm. 173 (2005) 186. (CPC Catalogue ID. ADWI_v1_0)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Arumugam, Kamesh
Efficient parallel implementations of scientific applications on multi-core CPUs with accelerators such as GPUs and Xeon Phis is challenging. This requires - exploiting the data parallel architecture of the accelerator along with the vector pipelines of modern x86 CPU architectures, load balancing, and efficient memory transfer between different devices. It is relatively easy to meet these requirements for highly structured scientific applications. In contrast, a number of scientific and engineering applications are unstructured. Getting performance on accelerators for these applications is extremely challenging because many of these applications employ irregular algorithms which exhibit data-dependent control-ow and irregular memory accesses. Furthermore,more » these applications are often iterative with dependency between steps, and thus making it hard to parallelize across steps. As a result, parallelism in these applications is often limited to a single step. Numerical simulation of charged particles beam dynamics is one such application where the distribution of work and memory access pattern at each time step is irregular. Applications with these properties tend to present significant branch and memory divergence, load imbalance between different processor cores, and poor compute and memory utilization. Prior research on parallelizing such irregular applications have been focused around optimizing the irregular, data-dependent memory accesses and control-ow during a single step of the application independent of the other steps, with the assumption that these patterns are completely unpredictable. We observed that the structure of computation leading to control-ow divergence and irregular memory accesses in one step is similar to that in the next step. It is possible to predict this structure in the current step by observing the computation structure of previous steps. In this dissertation, we present novel machine learning based optimization techniques to address the parallel implementation challenges of such irregular applications on different HPC architectures. In particular, we use supervised learning to predict the computation structure and use it to address the control-ow and memory access irregularities in the parallel implementation of such applications on GPUs, Xeon Phis, and heterogeneous architectures composed of multi-core CPUs with GPUs or Xeon Phis. We use numerical simulation of charged particles beam dynamics simulation as a motivating example throughout the dissertation to present our new approach, though they should be equally applicable to a wide range of irregular applications. The machine learning approach presented here use predictive analytics and forecasting techniques to adaptively model and track the irregular memory access pattern at each time step of the simulation to anticipate the future memory access pattern. Access pattern forecasts can then be used to formulate optimization decisions during application execution which improves the performance of the application at a future time step based on the observations from earlier time steps. In heterogeneous architectures, forecasts can also be used to improve the memory performance and resource utilization of all the processing units to deliver a good aggregate performance. We used these optimization techniques and anticipation strategy to design a cache-aware, memory efficient parallel algorithm to address the irregularities in the parallel implementation of charged particles beam dynamics simulation on different HPC architectures. Experimental result using a diverse mix of HPC architectures shows that our approach in using anticipation strategy is effective in maximizing data reuse, ensuring workload balance, minimizing branch and memory divergence, and in improving resource utilization.« less
Inverse halftoning via robust nonlinear filtering
NASA Astrophysics Data System (ADS)
Shen, Mei-Yin; Kuo, C.-C. Jay
1999-10-01
A new blind inverse halftoning algorithm based on a nonlinear filtering technique of low computational complexity and low memory requirement is proposed in this research. It is called blind since we do not require the knowledge of the halftone kernel. The proposed scheme performs nonlinear filtering in conjunction with edge enhancement to improve the quality of an inverse halftoned image. Distinct features of the proposed approach include: efficiently smoothing halftone patterns in large homogeneous areas, additional edge enhancement capability to recover the edge quality and an excellent PSNR performance with only local integer operations and a small memory buffer.
Memory Benchmarks for SMP-Based High Performance Parallel Computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yoo, A B; de Supinski, B; Mueller, F
2001-11-20
As the speed gap between CPU and main memory continues to grow, memory accesses increasingly dominates the performance of many applications. The problem is particularly acute for symmetric multiprocessor (SMP) systems, where the shared memory may be accessed concurrently by a group of threads running on separate CPUs. Unfortunately, several key issues governing memory system performance in current systems are not well understood. Complex interactions between the levels of the memory hierarchy, buses or switches, DRAM back-ends, system software, and application access patterns can make it difficult to pinpoint bottlenecks and determine appropriate optimizations, and the situation is even moremore » complex for SMP systems. To partially address this problem, we formulated a set of multi-threaded microbenchmarks for characterizing and measuring the performance of the underlying memory system in SMP-based high-performance computers. We report our use of these microbenchmarks on two important SMP-based machines. This paper has four primary contributions. First, we introduce a microbenchmark suite to systematically assess and compare the performance of different levels in SMP memory hierarchies. Second, we present a new tool based on hardware performance monitors to determine a wide array of memory system characteristics, such as cache sizes, quickly and easily; by using this tool, memory performance studies can be targeted to the full spectrum of performance regimes with many fewer data points than is otherwise required. Third, we present experimental results indicating that the performance of applications with large memory footprints remains largely constrained by memory. Fourth, we demonstrate that thread-level parallelism further degrades memory performance, even for the latest SMPs with hardware prefetching and switch-based memory interconnects.« less
Reducing power consumption during execution of an application on a plurality of compute nodes
Archer, Charles J.; Blocksome, Michael A.; Peters, Amanda E.; Ratterman, Joseph D.; Smith, Brian E.
2013-09-10
Methods, apparatus, and products are disclosed for reducing power consumption during execution of an application on a plurality of compute nodes that include: powering up, during compute node initialization, only a portion of computer memory of the compute node, including configuring an operating system for the compute node in the powered up portion of computer memory; receiving, by the operating system, an instruction to load an application for execution; allocating, by the operating system, additional portions of computer memory to the application for use during execution; powering up the additional portions of computer memory allocated for use by the application during execution; and loading, by the operating system, the application into the powered up additional portions of computer memory.
Resilient and Robust High Performance Computing Platforms for Scientific Computing Integrity
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jin, Yier
As technology advances, computer systems are subject to increasingly sophisticated cyber-attacks that compromise both their security and integrity. High performance computing platforms used in commercial and scientific applications involving sensitive, or even classified data, are frequently targeted by powerful adversaries. This situation is made worse by a lack of fundamental security solutions that both perform efficiently and are effective at preventing threats. Current security solutions fail to address the threat landscape and ensure the integrity of sensitive data. As challenges rise, both private and public sectors will require robust technologies to protect its computing infrastructure. The research outcomes from thismore » project try to address all these challenges. For example, we present LAZARUS, a novel technique to harden kernel Address Space Layout Randomization (KASLR) against paging-based side-channel attacks. In particular, our scheme allows for fine-grained protection of the virtual memory mappings that implement the randomization. We demonstrate the effectiveness of our approach by hardening a recent Linux kernel with LAZARUS, mitigating all of the previously presented side-channel attacks on KASLR. Our extensive evaluation shows that LAZARUS incurs only 0.943% overhead for standard benchmarks, and is therefore highly practical. We also introduced HA2lloc, a hardware-assisted allocator that is capable of leveraging an extended memory management unit to detect memory errors in the heap. We also perform testing using HA2lloc in a simulation environment and find that the approach is capable of preventing common memory vulnerabilities.« less
Yang, Tzuhsiung; Berry, John F
2018-06-04
The computation of nuclear second derivatives of energy, or the nuclear Hessian, is an essential routine in quantum chemical investigations of ground and transition states, thermodynamic calculations, and molecular vibrations. Analytic nuclear Hessian computations require the resolution of costly coupled-perturbed self-consistent field (CP-SCF) equations, while numerical differentiation of analytic first derivatives has an unfavorable 6 N ( N = number of atoms) prefactor. Herein, we present a new method in which grid computing is used to accelerate and/or enable the evaluation of the nuclear Hessian via numerical differentiation: NUMFREQ@Grid. Nuclear Hessians were successfully evaluated by NUMFREQ@Grid at the DFT level as well as using RIJCOSX-ZORA-MP2 or RIJCOSX-ZORA-B2PLYP for a set of linear polyacenes with systematically increasing size. For the larger members of this group, NUMFREQ@Grid was found to outperform the wall clock time of analytic Hessian evaluation; at the MP2 or B2LYP levels, these Hessians cannot even be evaluated analytically. We also evaluated a 156-atom catalytically relevant open-shell transition metal complex and found that NUMFREQ@Grid is faster (7.7 times shorter wall clock time) and less demanding (4.4 times less memory requirement) than an analytic Hessian. Capitalizing on the capabilities of parallel grid computing, NUMFREQ@Grid can outperform analytic methods in terms of wall time, memory requirements, and treatable system size. The NUMFREQ@Grid method presented herein demonstrates how grid computing can be used to facilitate embarrassingly parallel computational procedures and is a pioneer for future implementations.
Two-dimensional Euler and Navier-Stokes Time accurate simulations of fan rotor flows
NASA Technical Reports Server (NTRS)
Boretti, A. A.
1990-01-01
Two numerical methods are presented which describe the unsteady flow field in the blade-to-blade plane of an axial fan rotor. These methods solve the compressible, time-dependent, Euler and the compressible, turbulent, time-dependent, Navier-Stokes conservation equations for mass, momentum, and energy. The Navier-Stokes equations are written in Favre-averaged form and are closed with an approximate two-equation turbulence model with low Reynolds number and compressibility effects included. The unsteady aerodynamic component is obtained by superposing inflow or outflow unsteadiness to the steady conditions through time-dependent boundary conditions. The integration in space is performed by using a finite volume scheme, and the integration in time is performed by using k-stage Runge-Kutta schemes, k = 2,5. The numerical integration algorithm allows the reduction of the computational cost of an unsteady simulation involving high frequency disturbances in both CPU time and memory requirements. Less than 200 sec of CPU time are required to advance the Euler equations in a computational grid made up of about 2000 grid during 10,000 time steps on a CRAY Y-MP computer, with a required memory of less than 0.3 megawords.
Personal digital assistant applications for the healthcare provider.
Keplar, Kristine E; Urbanski, Christopher J
2003-02-01
To review some common medical applications available for personal digital assistants (PDAs), with brief discussion of the different PDA operating systems and memory requirements. Key search terms included handheld, PDA, personal digital assistants, and medical applications. The literature was accessed through MEDLINE (1999-August 2002). Other information was obtained through secondary sources such as Web sites describing common PDAs. Medical applications available on PDAs are numerous and include general drug references, specialized drug references (e.g., pediatrics, geriatrics, cardiology, infectious disease), diagnostic guides, medical calculators, herbal medication references, nursing references, toxicology references, and patient tracking databases. Costs and memory requirements for these programs can vary; consequently, the healthcare provider must limit the medication applications that are placed on the handheld computer. This article attempts to systematically describe the common medical applications available for the handheld computer along with cost, memory and download requirements, and Web site information. This review found many excellent PDA drug information applications offering many features which will aid the healthcare provider. Very likely, after using these PDA applications, the healthcare provider will find them indispensable, as their multifunctional capabilities can save time, improve accuracy, and allow for general business procedures as well as being a quick reference tool. To avoid the benefits of this technology might be a step backward.
Li, Guoqi; Deng, Lei; Wang, Dong; Wang, Wei; Zeng, Fei; Zhang, Ziyang; Li, Huanglong; Song, Sen; Pei, Jing; Shi, Luping
2016-01-01
Chunking refers to a phenomenon whereby individuals group items together when performing a memory task to improve the performance of sequential memory. In this work, we build a bio-plausible hierarchical chunking of sequential memory (HCSM) model to explain why such improvement happens. We address this issue by linking hierarchical chunking with synaptic plasticity and neuromorphic engineering. We uncover that a chunking mechanism reduces the requirements of synaptic plasticity since it allows applying synapses with narrow dynamic range and low precision to perform a memory task. We validate a hardware version of the model through simulation, based on measured memristor behavior with narrow dynamic range in neuromorphic circuits, which reveals how chunking works and what role it plays in encoding sequential memory. Our work deepens the understanding of sequential memory and enables incorporating it for the investigation of the brain-inspired computing on neuromorphic architecture. PMID:28066223
Coherent concepts are computed in the anterior temporal lobes.
Lambon Ralph, Matthew A; Sage, Karen; Jones, Roy W; Mayberry, Emily J
2010-02-09
In his Philosophical Investigations, Wittgenstein famously noted that the formation of semantic representations requires more than a simple combination of verbal and nonverbal features to generate conceptually based similarities and differences. Classical and contemporary neuroscience has tended to focus upon how different neocortical regions contribute to conceptualization through the summation of modality-specific information. The additional yet critical step of computing coherent concepts has received little attention. Some computational models of semantic memory are able to generate such concepts by the addition of modality-invariant information coded in a multidimensional semantic space. By studying patients with semantic dementia, we demonstrate that this aspect of semantic memory becomes compromised following atrophy of the anterior temporal lobes and, as a result, the patients become increasingly influenced by superficial rather than conceptual similarities.
1991-07-31
INTELLIGENT SCSI DMV-719 MAS MIL CONTROLLER DY-4 SYSTEMS BYTE-WIDE MEMORY CARD DMV-536 MEM MIL DY-4 SYSTEMS POWER SUPPLY UNIT DMV-870 PWR MIL P age No. 5 06/10...FORCE COMPUTERS PROCESSOR CPU-386 SERIES SBC COM FORCE COMPUTERS ADVANCED SYSTEM CONTROL ASCU -1/2 SBC COM UNITI FORCE COMPUTERS GRAPHICS CONTROLLER AGC...RECORD VENDOR: JANZ COMPUTER AG DIVISION: VENDOR ADDRESS: Im Doerener Feld 3 D-4790 Paderborn Germany MARKETING: Johannes Kunz TECHNICAL: Arnulf
Levy, Scott; Ferreira, Kurt B.; Bridges, Patrick G.; ...
2014-12-09
Building the next-generation of extreme-scale distributed systems will require overcoming several challenges related to system resilience. As the number of processors in these systems grow, the failure rate increases proportionally. One of the most common sources of failure in large-scale systems is memory. In this paper, we propose a novel runtime for transparently exploiting memory content similarity to improve system resilience by reducing the rate at which memory errors lead to node failure. We evaluate the viability of this approach by examining memory snapshots collected from eight high-performance computing (HPC) applications and two important HPC operating systems. Based on themore » characteristics of the similarity uncovered, we conclude that our proposed approach shows promise for addressing system resilience in large-scale systems.« less
Unconditional room-temperature quantum memory
NASA Astrophysics Data System (ADS)
Hosseini, M.; Campbell, G.; Sparkes, B. M.; Lam, P. K.; Buchler, B. C.
2011-10-01
Just as classical information systems require buffers and memory, the same is true for quantum information systems. The potential that optical quantum information processing holds for revolutionizing computation and communication is therefore driving significant research into developing optical quantum memory. A practical optical quantum memory must be able to store and recall quantum states on demand with high efficiency and low noise. Ideally, the platform for the memory would also be simple and inexpensive. Here, we present a complete tomographic reconstruction of quantum states that have been stored in the ground states of rubidium in a vapour cell operating at around 80°C. Without conditional measurements, we show recall fidelity up to 98% for coherent pulses containing around one photon. To unambiguously verify that our memory beats the quantum no-cloning limit we employ state-independent verification using conditional variance and signal-transfer coefficients.
The role of early visual cortex in visual short-term memory and visual attention.
Offen, Shani; Schluppeck, Denis; Heeger, David J
2009-06-01
We measured cortical activity with functional magnetic resonance imaging to probe the involvement of early visual cortex in visual short-term memory and visual attention. In four experimental tasks, human subjects viewed two visual stimuli separated by a variable delay period. The tasks placed differential demands on short-term memory and attention, but the stimuli were visually identical until after the delay period. Early visual cortex exhibited sustained responses throughout the delay when subjects performed attention-demanding tasks, but delay-period activity was not distinguishable from zero when subjects performed a task that required short-term memory. This dissociation reveals different computational mechanisms underlying the two processes.
Using a Cray Y-MP as an array processor for a RISC Workstation
NASA Technical Reports Server (NTRS)
Lamaster, Hugh; Rogallo, Sarah J.
1992-01-01
As microprocessors increase in power, the economics of centralized computing has changed dramatically. At the beginning of the 1980's, mainframes and super computers were often considered to be cost-effective machines for scalar computing. Today, microprocessor-based RISC (reduced-instruction-set computer) systems have displaced many uses of mainframes and supercomputers. Supercomputers are still cost competitive when processing jobs that require both large memory size and high memory bandwidth. One such application is array processing. Certain numerical operations are appropriate to use in a Remote Procedure Call (RPC)-based environment. Matrix multiplication is an example of an operation that can have a sufficient number of arithmetic operations to amortize the cost of an RPC call. An experiment which demonstrates that matrix multiplication can be executed remotely on a large system to speed the execution over that experienced on a workstation is described.
Real-time depth processing for embedded platforms
NASA Astrophysics Data System (ADS)
Rahnama, Oscar; Makarov, Aleksej; Torr, Philip
2017-05-01
Obtaining depth information of a scene is an important requirement in many computer-vision and robotics applications. For embedded platforms, passive stereo systems have many advantages over their active counterparts (i.e. LiDAR, Infrared). They are power efficient, cheap, robust to lighting conditions and inherently synchronized to the RGB images of the scene. However, stereo depth estimation is a computationally expensive task that operates over large amounts of data. For embedded applications which are often constrained by power consumption, obtaining accurate results in real-time is a challenge. We demonstrate a computationally and memory efficient implementation of a stereo block-matching algorithm in FPGA. The computational core achieves a throughput of 577 fps at standard VGA resolution whilst consuming less than 3 Watts of power. The data is processed using an in-stream approach that minimizes memory-access bottlenecks and best matches the raster scan readout of modern digital image sensors.
Genotype Imputation with Millions of Reference Samples.
Browning, Brian L; Browning, Sharon R
2016-01-07
We present a genotype imputation method that scales to millions of reference samples. The imputation method, based on the Li and Stephens model and implemented in Beagle v.4.1, is parallelized and memory efficient, making it well suited to multi-core computer processors. It achieves fast, accurate, and memory-efficient genotype imputation by restricting the probability model to markers that are genotyped in the target samples and by performing linear interpolation to impute ungenotyped variants. We compare Beagle v.4.1 with Impute2 and Minimac3 by using 1000 Genomes Project data, UK10K Project data, and simulated data. All three methods have similar accuracy but different memory requirements and different computation times. When imputing 10 Mb of sequence data from 50,000 reference samples, Beagle's throughput was more than 100× greater than Impute2's throughput on our computer servers. When imputing 10 Mb of sequence data from 200,000 reference samples in VCF format, Minimac3 consumed 26× more memory per computational thread and 15× more CPU time than Beagle. We demonstrate that Beagle v.4.1 scales to much larger reference panels by performing imputation from a simulated reference panel having 5 million samples and a mean marker density of one marker per four base pairs. Copyright © 2016 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
50 CFR 680.23 - Equipment and operational requirements.
Code of Federal Regulations, 2010 CFR
2010-10-01
... stored in the scale computer memory is replaced. Scale weights must not be adjusted by the scale operator... for an inspection to NMFS, Alaska Region. An inspection must be requested no less than 10 working days...
Confessions of a robot lobotomist
NASA Technical Reports Server (NTRS)
Gottshall, R. Marc
1994-01-01
Since its inception, numerically controlled (NC) machining methods have been used throughout the aerospace industry to mill, drill, and turn complex shapes by sequentially stepping through motion programs. However, the recent demand for more precision, faster feeds, exotic sensors, and branching execution have existing computer numerical control (CNC) and distributed numerical control (DNC) systems running at maximum controller capacity. Typical disadvantages of current CNC's include fixed memory capacities, limited communication ports, and the use of multiple control languages. The need to tailor CNC's to meet specific applications, whether it be expanded memory, additional communications, or integrated vision, often requires replacing the original controller supplied with the commercial machine tool with a more powerful and capable system. This paper briefly describes the process and equipment requirements for new controllers and their evolutionary implementation in an aerospace environment. The process of controller retrofit with currently available machines is examined, along with several case studies and their computational and architectural implications.
The potential of multi-port optical memories in digital computing
NASA Technical Reports Server (NTRS)
Alford, C. O.; Gaylord, T. K.
1975-01-01
A high-capacity memory with a relatively high data transfer rate and multi-port simultaneous access capability may serve as the basis for new computer architectures. The implementation of a multi-port optical memory is discussed. Several computer structures are presented that might profitably use such a memory. These structures include (1) a simultaneous record access system, (2) a simultaneously shared memory computer system, and (3) a parallel digital processing structure.
Optical read/write memory system components
NASA Technical Reports Server (NTRS)
Kozma, A.
1972-01-01
The optical components of a breadboard holographic read/write memory system have been fabricated and the parameters specified of the major system components: (1) a laser system; (2) an x-y beam deflector; (3) a block data composer; (4) the read/write memory material; (5) an output detector array; and (6) the electronics to drive, synchronize, and control all system components. The objectives of the investigation were divided into three concurrent phases: (1) to supply and fabricate the major components according to the previously established specifications; (2) to prepare computer programs to simulate the entire holographic memory system so that a designer can balance the requirements on the various components; and (3) to conduct a development program to optimize the combined recording and reconstruction process of the high density holographic memory system.
Scalable Quantum Networks for Distributed Computing and Sensing
2016-04-01
probabilistic measurement , so we developed quantum memories and guided-wave implementations of same, demonstrating controlled delay of a heralded single...Second, fundamental scalability requires a method to synchronize protocols based on quantum measurements , which are inherently probabilistic. To meet...AFRL-AFOSR-UK-TR-2016-0007 Scalable Quantum Networks for Distributed Computing and Sensing Ian Walmsley THE UNIVERSITY OF OXFORD Final Report 04/01
Proteinortho: Detection of (Co-)orthologs in large-scale analysis
2011-01-01
Background Orthology analysis is an important part of data analysis in many areas of bioinformatics such as comparative genomics and molecular phylogenetics. The ever-increasing flood of sequence data, and hence the rapidly increasing number of genomes that can be compared simultaneously, calls for efficient software tools as brute-force approaches with quadratic memory requirements become infeasible in practise. The rapid pace at which new data become available, furthermore, makes it desirable to compute genome-wide orthology relations for a given dataset rather than relying on relations listed in databases. Results The program Proteinortho described here is a stand-alone tool that is geared towards large datasets and makes use of distributed computing techniques when run on multi-core hardware. It implements an extended version of the reciprocal best alignment heuristic. We apply Proteinortho to compute orthologous proteins in the complete set of all 717 eubacterial genomes available at NCBI at the beginning of 2009. We identified thirty proteins present in 99% of all bacterial proteomes. Conclusions Proteinortho significantly reduces the required amount of memory for orthology analysis compared to existing tools, allowing such computations to be performed on off-the-shelf hardware. PMID:21526987
Run-time scheduling and execution of loops on message passing machines
NASA Technical Reports Server (NTRS)
Crowley, Kay; Saltz, Joel; Mirchandaney, Ravi; Berryman, Harry
1989-01-01
Sparse system solvers and general purpose codes for solving partial differential equations are examples of the many types of problems whose irregularity can result in poor performance on distributed memory machines. Often, the data structures used in these problems are very flexible. Crucial details concerning loop dependences are encoded in these structures rather than being explicitly represented in the program. Good methods for parallelizing and partitioning these types of problems require assignment of computations in rather arbitrary ways. Naive implementations of programs on distributed memory machines requiring general loop partitions can be extremely inefficient. Instead, the scheduling mechanism needs to capture the data reference patterns of the loops in order to partition the problem. First, the indices assigned to each processor must be locally numbered. Next, it is necessary to precompute what information is needed by each processor at various points in the computation. The precomputed information is then used to generate an execution template designed to carry out the computation, communication, and partitioning of data, in an optimized manner. The design is presented for a general preprocessor and schedule executer, the structures of which do not vary, even though the details of the computation and of the type of information are problem dependent.
Run-time scheduling and execution of loops on message passing machines
NASA Technical Reports Server (NTRS)
Saltz, Joel; Crowley, Kathleen; Mirchandaney, Ravi; Berryman, Harry
1990-01-01
Sparse system solvers and general purpose codes for solving partial differential equations are examples of the many types of problems whose irregularity can result in poor performance on distributed memory machines. Often, the data structures used in these problems are very flexible. Crucial details concerning loop dependences are encoded in these structures rather than being explicitly represented in the program. Good methods for parallelizing and partitioning these types of problems require assignment of computations in rather arbitrary ways. Naive implementations of programs on distributed memory machines requiring general loop partitions can be extremely inefficient. Instead, the scheduling mechanism needs to capture the data reference patterns of the loops in order to partition the problem. First, the indices assigned to each processor must be locally numbered. Next, it is necessary to precompute what information is needed by each processor at various points in the computation. The precomputed information is then used to generate an execution template designed to carry out the computation, communication, and partitioning of data, in an optimized manner. The design is presented for a general preprocessor and schedule executer, the structures of which do not vary, even though the details of the computation and of the type of information are problem dependent.
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets
Bicer, Tekin; Gursoy, Doga; Andrade, Vincent De; ...
2017-01-28
Here, synchrotron light source and detector technologies enable scientists to perform advanced experiments. These scientific instruments and experiments produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used data acquisition technique at light sources is Computed Tomography, which can generate tens of GB/s depending on x-ray range. A large-scale tomographic dataset, such as mouse brain, may require hours of computation time with a medium size workstation. In this paper, we present Trace, a data-intensive computing middleware we developed for implementation and parallelization of iterative tomographic reconstruction algorithms. Tracemore » provides fine-grained reconstruction of tomography datasets using both (thread level) shared memory and (process level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations we have done on the replicated reconstruction objects and evaluate them using a shale and a mouse brain sinogram. Our experimental evaluations show that the applied optimizations and parallelization techniques can provide 158x speedup (using 32 compute nodes) over single core configuration, which decreases the reconstruction time of a sinogram (with 4501 projections and 22400 detector resolution) from 12.5 hours to less than 5 minutes per iteration.« less
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bicer, Tekin; Gursoy, Doga; Andrade, Vincent De
Here, synchrotron light source and detector technologies enable scientists to perform advanced experiments. These scientific instruments and experiments produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used data acquisition technique at light sources is Computed Tomography, which can generate tens of GB/s depending on x-ray range. A large-scale tomographic dataset, such as mouse brain, may require hours of computation time with a medium size workstation. In this paper, we present Trace, a data-intensive computing middleware we developed for implementation and parallelization of iterative tomographic reconstruction algorithms. Tracemore » provides fine-grained reconstruction of tomography datasets using both (thread level) shared memory and (process level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations we have done on the replicated reconstruction objects and evaluate them using a shale and a mouse brain sinogram. Our experimental evaluations show that the applied optimizations and parallelization techniques can provide 158x speedup (using 32 compute nodes) over single core configuration, which decreases the reconstruction time of a sinogram (with 4501 projections and 22400 detector resolution) from 12.5 hours to less than 5 minutes per iteration.« less
Multinode reconfigurable pipeline computer
NASA Technical Reports Server (NTRS)
Nosenchuck, Daniel M. (Inventor); Littman, Michael G. (Inventor)
1989-01-01
A multinode parallel-processing computer is made up of a plurality of innerconnected, large capacity nodes each including a reconfigurable pipeline of functional units such as Integer Arithmetic Logic Processors, Floating Point Arithmetic Processors, Special Purpose Processors, etc. The reconfigurable pipeline of each node is connected to a multiplane memory by a Memory-ALU switch NETwork (MASNET). The reconfigurable pipeline includes three (3) basic substructures formed from functional units which have been found to be sufficient to perform the bulk of all calculations. The MASNET controls the flow of signals from the memory planes to the reconfigurable pipeline and vice versa. the nodes are connectable together by an internode data router (hyperspace router) so as to form a hypercube configuration. The capability of the nodes to conditionally configure the pipeline at each tick of the clock, without requiring a pipeline flush, permits many powerful algorithms to be implemented directly.
[Method of file sorting for mini- and microcomputers].
Chau, N; Legras, B; Benamghar, L; Martin, J
1983-05-01
The authors describe a new sorting method of files which belongs to the class of direct-addressing sorting methods. It makes use of a variant of the classical technique of 'virtual memory'. It is particularly well suited to mini- and micro-computers which have a small core memory (32 K words, for example) and are fitted with a direct-access peripheral device, such as a disc unit. When the file to be sorted is medium-sized (some thousand records), the running of the program essentially occurs inside the core memory and consequently, the method becomes very fast. This is very important because most medical files handled in our laboratory are in this category. However, the method is also suitable for big computers and large files; its implementation is easy. It does not require any magnetic tape unit, and it seems to us to be one of the fastest methods available.
NASA Astrophysics Data System (ADS)
Onizawa, Naoya; Tamakoshi, Akira; Hanyu, Takahiro
2017-08-01
In this paper, reinitialization-free nonvolatile computer systems are designed and evaluated for energy-harvesting Internet of things (IoT) applications. In energy-harvesting applications, as power supplies generated from renewable power sources cause frequent power failures, data processed need to be backed up when power failures occur. Unless data are safely backed up before power supplies diminish, reinitialization processes are required when power supplies are recovered, which results in low energy efficiencies and slow operations. Using nonvolatile devices in processors and memories can realize a faster backup than a conventional volatile computer system, leading to a higher energy efficiency. To evaluate the energy efficiency upon frequent power failures, typical computer systems including processors and memories are designed using 90 nm CMOS or CMOS/magnetic tunnel junction (MTJ) technologies. Nonvolatile ARM Cortex-M0 processors with 4 kB MRAMs are evaluated using a typical computing benchmark program, Dhrystone, which shows a few order-of-magnitude reductions in energy in comparison with a volatile processor with SRAM.
Reverse time migration by Krylov subspace reduced order modeling
NASA Astrophysics Data System (ADS)
Basir, Hadi Mahdavi; Javaherian, Abdolrahim; Shomali, Zaher Hossein; Firouz-Abadi, Roohollah Dehghani; Gholamy, Shaban Ali
2018-04-01
Imaging is a key step in seismic data processing. To date, a myriad of advanced pre-stack depth migration approaches have been developed; however, reverse time migration (RTM) is still considered as the high-end imaging algorithm. The main limitations associated with the performance cost of reverse time migration are the intensive computation of the forward and backward simulations, time consumption, and memory allocation related to imaging condition. Based on the reduced order modeling, we proposed an algorithm, which can be adapted to all the aforementioned factors. Our proposed method benefit from Krylov subspaces method to compute certain mode shapes of the velocity model computed by as an orthogonal base of reduced order modeling. Reverse time migration by reduced order modeling is helpful concerning the highly parallel computation and strongly reduces the memory requirement of reverse time migration. The synthetic model results showed that suggested method can decrease the computational costs of reverse time migration by several orders of magnitudes, compared with reverse time migration by finite element method.
Tabe-Bordbar, Shayan; Marashi, Sayed-Amir
2013-12-01
Elementary modes (EMs) are steady-state metabolic flux vectors with minimal set of active reactions. Each EM corresponds to a metabolic pathway. Therefore, studying EMs is helpful for analyzing the production of biotechnologically important metabolites. However, memory requirements for computing EMs may hamper their applicability as, in most genome-scale metabolic models, no EM can be computed due to running out of memory. In this study, we present a method for computing randomly sampled EMs. In this approach, a network reduction algorithm is used for EM computation, which is based on flux balance-based methods. We show that this approach can be used to recover the EMs in the medium- and genome-scale metabolic network models, while the EMs are sampled in an unbiased way. The applicability of such results is shown by computing “estimated” control-effective flux values in Escherichia coli metabolic network.
GaAs Supercomputing: Architecture, Language, And Algorithms For Image Processing
NASA Astrophysics Data System (ADS)
Johl, John T.; Baker, Nick C.
1988-10-01
The application of high-speed GaAs processors in a parallel system matches the demanding computational requirements of image processing. The architecture of the McDonnell Douglas Astronautics Company (MDAC) vector processor is described along with the algorithms and language translator. Most image and signal processing algorithms can utilize parallel processing and show a significant performance improvement over sequential versions. The parallelization performed by this system is within each vector instruction. Since each vector has many elements, each requiring some computation, useful concurrent arithmetic operations can easily be performed. Balancing the memory bandwidth with the computation rate of the processors is an important design consideration for high efficiency and utilization. The architecture features a bus-based execution unit consisting of four to eight 32-bit GaAs RISC microprocessors running at a 200 MHz clock rate for a peak performance of 1.6 BOPS. The execution unit is connected to a vector memory with three buses capable of transferring two input words and one output word every 10 nsec. The address generators inside the vector memory perform different vector addressing modes and feed the data to the execution unit. The functions discussed in this paper include basic MATRIX OPERATIONS, 2-D SPATIAL CONVOLUTION, HISTOGRAM, and FFT. For each of these algorithms, assembly language programs were run on a behavioral model of the system to obtain performance figures.
Donato, David I.
2012-01-01
This report presents the mathematical expressions and the computational techniques required to compute maximum-likelihood estimates for the parameters of the National Descriptive Model of Mercury in Fish (NDMMF), a statistical model used to predict the concentration of methylmercury in fish tissue. The expressions and techniques reported here were prepared to support the development of custom software capable of computing NDMMF parameter estimates more quickly and using less computer memory than is currently possible with available general-purpose statistical software. Computation of maximum-likelihood estimates for the NDMMF by numerical solution of a system of simultaneous equations through repeated Newton-Raphson iterations is described. This report explains the derivation of the mathematical expressions required for computational parameter estimation in sufficient detail to facilitate future derivations for any revised versions of the NDMMF that may be developed.
NASA Astrophysics Data System (ADS)
Okamoto, Taro; Takenaka, Hiroshi; Nakamura, Takeshi; Aoki, Takayuki
2010-12-01
We adopted the GPU (graphics processing unit) to accelerate the large-scale finite-difference simulation of seismic wave propagation. The simulation can benefit from the high-memory bandwidth of GPU because it is a "memory intensive" problem. In a single-GPU case we achieved a performance of about 56 GFlops, which was about 45-fold faster than that achieved by a single core of the host central processing unit (CPU). We confirmed that the optimized use of fast shared memory and registers were essential for performance. In the multi-GPU case with three-dimensional domain decomposition, the non-contiguous memory alignment in the ghost zones was found to impose quite long time in data transfer between GPU and the host node. This problem was solved by using contiguous memory buffers for ghost zones. We achieved a performance of about 2.2 TFlops by using 120 GPUs and 330 GB of total memory: nearly (or more than) 2200 cores of host CPUs would be required to achieve the same performance. The weak scaling was nearly proportional to the number of GPUs. We therefore conclude that GPU computing for large-scale simulation of seismic wave propagation is a promising approach as a faster simulation is possible with reduced computational resources compared to CPUs.
Event parallelism: Distributed memory parallel computing for high energy physics experiments
NASA Astrophysics Data System (ADS)
Nash, Thomas
1989-12-01
This paper describes the present and expected future development of distributed memory parallel computers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC system, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described.
Application of a simple cerebellar model to geologic surface mapping
Hagens, A.; Doveton, J.H.
1991-01-01
Neurophysiological research into the structure and function of the cerebellum has inspired computational models that simulate information processing associated with coordination and motor movement. The cerebellar model arithmetic computer (CMAC) has a design structure which makes it readily applicable as an automated mapping device that "senses" a surface, based on a sample of discrete observations of surface elevation. The model operates as an iterative learning process, where cell weights are continuously modified by feedback to improve surface representation. The storage requirements are substantially less than those of a conventional memory allocation, and the model is extended easily to mapping in multidimensional space, where the memory savings are even greater. ?? 1991.
Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures
2017-10-04
Report: Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures The views, opinions and/or findings contained in this...Chapel Hill Title: Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures Report Term: 0-Other Email: dm...algorithms for scientific and geometric computing by exploiting the power and performance efficiency of heterogeneous shared memory architectures . These
Internode data communications in a parallel computer
Archer, Charles J.; Blocksome, Michael A.; Miller, Douglas R.; Parker, Jeffrey J.; Ratterman, Joseph D.; Smith, Brian E.
2013-09-03
Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.
Internode data communications in a parallel computer
Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Parker, Jeffrey J; Ratterman, Joseph D; Smith, Brian E
2014-02-11
Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.
NASA Astrophysics Data System (ADS)
Sasano, Koji; Okajima, Hiroshi; Matsunaga, Nobutomo
Recently, the fractional order PID (FO-PID) control, which is the extension of the PID control, has been focused on. Even though the FO-PID requires the high-order filter, it is difficult to realize the high-order filter due to the memory limitation of digital computer. For implementation of FO-PID, approximation of the fractional integrator and differentiator are required. Short memory principle (SMP) is one of the effective approximation methods. However, there is a disadvantage that the approximated filter with SMP cannot eliminate the steady-state error. For this problem, we introduce the distributed implementation of the integrator and the dynamic quantizer to make the efficient use of permissible memory. The objective of this study is to clarify how to implement the accurate FO-PID with limited memories. In this paper, we propose the implementation method of FO-PID with memory constraint using dynamic quantizer. And the trade off between approximation of fractional elements and quantized data size are examined so as to close to the ideal FO-PID responses. The effectiveness of proposed method is evaluated by numerical example and experiment in the temperature control of heat plate.
Ehsan, Shoaib; Clark, Adrian F.; ur Rehman, Naveed; McDonald-Maier, Klaus D.
2015-01-01
The integral image, an intermediate image representation, has found extensive use in multi-scale local feature detection algorithms, such as Speeded-Up Robust Features (SURF), allowing fast computation of rectangular features at constant speed, independent of filter size. For resource-constrained real-time embedded vision systems, computation and storage of integral image presents several design challenges due to strict timing and hardware limitations. Although calculation of the integral image only consists of simple addition operations, the total number of operations is large owing to the generally large size of image data. Recursive equations allow substantial decrease in the number of operations but require calculation in a serial fashion. This paper presents two new hardware algorithms that are based on the decomposition of these recursive equations, allowing calculation of up to four integral image values in a row-parallel way without significantly increasing the number of operations. An efficient design strategy is also proposed for a parallel integral image computation unit to reduce the size of the required internal memory (nearly 35% for common HD video). Addressing the storage problem of integral image in embedded vision systems, the paper presents two algorithms which allow substantial decrease (at least 44.44%) in the memory requirements. Finally, the paper provides a case study that highlights the utility of the proposed architectures in embedded vision systems. PMID:26184211
Ehsan, Shoaib; Clark, Adrian F; Naveed ur Rehman; McDonald-Maier, Klaus D
2015-07-10
The integral image, an intermediate image representation, has found extensive use in multi-scale local feature detection algorithms, such as Speeded-Up Robust Features (SURF), allowing fast computation of rectangular features at constant speed, independent of filter size. For resource-constrained real-time embedded vision systems, computation and storage of integral image presents several design challenges due to strict timing and hardware limitations. Although calculation of the integral image only consists of simple addition operations, the total number of operations is large owing to the generally large size of image data. Recursive equations allow substantial decrease in the number of operations but require calculation in a serial fashion. This paper presents two new hardware algorithms that are based on the decomposition of these recursive equations, allowing calculation of up to four integral image values in a row-parallel way without significantly increasing the number of operations. An efficient design strategy is also proposed for a parallel integral image computation unit to reduce the size of the required internal memory (nearly 35% for common HD video). Addressing the storage problem of integral image in embedded vision systems, the paper presents two algorithms which allow substantial decrease (at least 44.44%) in the memory requirements. Finally, the paper provides a case study that highlights the utility of the proposed architectures in embedded vision systems.
Smart photodetector arrays for error control in page-oriented optical memory
NASA Astrophysics Data System (ADS)
Schaffer, Maureen Elizabeth
1998-12-01
Page-oriented optical memories (POMs) have been proposed to meet high speed, high capacity storage requirements for input/output intensive computer applications. This technology offers the capability for storage and retrieval of optical data in two-dimensional pages resulting in high throughput data rates. Since currently measured raw bit error rates for these systems fall several orders of magnitude short of industry requirements for binary data storage, powerful error control codes must be adopted. These codes must be designed to take advantage of the two-dimensional memory output. In addition, POMs require an optoelectronic interface to transfer the optical data pages to one or more electronic host systems. Conventional charge coupled device (CCD) arrays can receive optical data in parallel, but the relatively slow serial electronic output of these devices creates a system bottleneck thereby eliminating the POM advantage of high transfer rates. Also, CCD arrays are "unintelligent" interfaces in that they offer little data processing capabilities. The optical data page can be received by two-dimensional arrays of "smart" photo-detector elements that replace conventional CCD arrays. These smart photodetector arrays (SPAs) can perform fast parallel data decoding and error control, thereby providing an efficient optoelectronic interface between the memory and the electronic computer. This approach optimizes the computer memory system by combining the massive parallelism and high speed of optics with the diverse functionality, low cost, and local interconnection efficiency of electronics. In this dissertation we examine the design of smart photodetector arrays for use as the optoelectronic interface for page-oriented optical memory. We review options and technologies for SPA fabrication, develop SPA requirements, and determine SPA scalability constraints with respect to pixel complexity, electrical power dissipation, and optical power limits. Next, we examine data modulation and error correction coding for the purpose of error control in the POM system. These techniques are adapted, where possible, for 2D data and evaluated as to their suitability for a SPA implementation in terms of BER, code rate, decoder time and pixel complexity. Our analysis shows that differential data modulation combined with relatively simple block codes known as array codes provide a powerful means to achieve the desired data transfer rates while reducing error rates to industry requirements. Finally, we demonstrate the first smart photodetector array designed to perform parallel error correction on an entire page of data and satisfy the sustained data rates of page-oriented optical memories. Our implementation integrates a monolithic PN photodiode array and differential input receiver for optoelectronic signal conversion with a cluster error correction code using 0.35-mum CMOS. This approach provides high sensitivity, low electrical power dissipation, and fast parallel correction of 2 x 2-bit cluster errors in an 8 x 8 bit code block to achieve corrected output data rates scalable to 102 Gbps in the current technology increasing to 1.88 Tbps in 0.1-mum CMOS.
Vector computer memory bank contention
NASA Technical Reports Server (NTRS)
Bailey, D. H.
1985-01-01
A number of vector supercomputers feature very large memories. Unfortunately the large capacity memory chips that are used in these computers are much slower than the fast central processing unit (CPU) circuitry. As a result, memory bank reservation times (in CPU ticks) are much longer than on previous generations of computers. A consequence of these long reservation times is that memory bank contention is sharply increased, resulting in significantly lowered performance rates. The phenomenon of memory bank contention in vector computers is analyzed using both a Markov chain model and a Monte Carlo simulation program. The results of this analysis indicate that future generations of supercomputers must either employ much faster memory chips or else feature very large numbers of independent memory banks.
Vector computer memory bank contention
NASA Technical Reports Server (NTRS)
Bailey, David H.
1987-01-01
A number of vector supercomputers feature very large memories. Unfortunately the large capacity memory chips that are used in these computers are much slower than the fast central processing unit (CPU) circuitry. As a result, memory bank reservation times (in CPU ticks) are much longer than on previous generations of computers. A consequence of these long reservation times is that memory bank contention is sharply increased, resulting in significantly lowered performance rates. The phenomenon of memory bank contention in vector computers is analyzed using both a Markov chain model and a Monte Carlo simulation program. The results of this analysis indicate that future generations of supercomputers must either employ much faster memory chips or else feature very large numbers of independent memory banks.
NASA Technical Reports Server (NTRS)
Gerber, C. R.
1972-01-01
The computation and logical functions which are performed by the data processing assembly of the modular space station are defined. The subjects discussed are: (1) requirements analysis, (2) baseline data processing assembly configuration, (3) information flow study, (4) throughput simulation, (5) redundancy study, (6) memory studies, and (7) design requirements specification.
Enhancing an appointment diary on a pocket computer for use by people after brain injury.
Wright, P; Rogers, N; Hall, C; Wilson, B; Evans, J; Emslie, H
2001-12-01
People with memory loss resulting from brain injury benefit from purpose-designed memory aids such as appointment diaries on pocket computers. The present study explores the effects of extending the range of memory aids and including games. For 2 months, 12 people who had sustained brain injury were loaned a pocket computer containing three purpose-designed memory aids: diary, notebook and to-do list. A month later they were given another computer with the same memory aids but a different method of text entry (physical keyboard or touch-screen keyboard). Machine order was counterbalanced across participants. Assessment was by interviews during the loan periods, rating scales, performance tests and computer log files. All participants could use the memory aids and ten people (83%) found them very useful. Correlations among the three memory aids were not significant, suggesting individual variation in how they were used. Games did not increase use of the memory aids, nor did loan of the preferred pocket computer (with physical keyboard). Significantly more diary entries were made by people who had previously used other memory aids, suggesting that a better understanding of how to use a range of memory aids could benefit some people with brain injury.
Parallel computing for probabilistic fatigue analysis
NASA Technical Reports Server (NTRS)
Sues, Robert H.; Lua, Yuan J.; Smith, Mark D.
1993-01-01
This paper presents the results of Phase I research to investigate the most effective parallel processing software strategies and hardware configurations for probabilistic structural analysis. We investigate the efficiency of both shared and distributed-memory architectures via a probabilistic fatigue life analysis problem. We also present a parallel programming approach, the virtual shared-memory paradigm, that is applicable across both types of hardware. Using this approach, problems can be solved on a variety of parallel configurations, including networks of single or multiprocessor workstations. We conclude that it is possible to effectively parallelize probabilistic fatigue analysis codes; however, special strategies will be needed to achieve large-scale parallelism to keep large number of processors busy and to treat problems with the large memory requirements encountered in practice. We also conclude that distributed-memory architecture is preferable to shared-memory for achieving large scale parallelism; however, in the future, the currently emerging hybrid-memory architectures will likely be optimal.
Practical Unitary Simulator for Non-Markovian Complex Processes
NASA Astrophysics Data System (ADS)
Binder, Felix C.; Thompson, Jayne; Gu, Mile
2018-06-01
Stochastic processes are as ubiquitous throughout the quantitative sciences as they are notorious for being difficult to simulate and predict. In this Letter, we propose a unitary quantum simulator for discrete-time stochastic processes which requires less internal memory than any classical analogue throughout the simulation. The simulator's internal memory requirements equal those of the best previous quantum models. However, in contrast to previous models, it only requires a (small) finite-dimensional Hilbert space. Moreover, since the simulator operates unitarily throughout, it avoids any unnecessary information loss. We provide a stepwise construction for simulators for a large class of stochastic processes hence directly opening the possibility for experimental implementations with current platforms for quantum computation. The results are illustrated for an example process.
2016-11-01
Feasibility of using Shape Memory Alloys for Gas Turbine Blade Actuation by Kathryn Esham, Luis Bravo, Anindya Ghoshal, Muthuvel Murugan, and Michael...Computational Study on the Feasibility of using Shape Memory Alloys for Gas Turbine Blade Actuation by Luis Bravo, Anindya Ghoshal, Muthuvel...High Performance Computing (HPC)-Enabled Computational Study on the Feasibility of using Shape Memory Alloys for Gas Turbine Blade Actuation 5a
Scaling to Nanotechnology Limits with the PIMS Computer Architecture and a new Scaling Rule
DOE Office of Scientific and Technical Information (OSTI.GOV)
Debenedictis, Erik P.
2015-02-01
We describe a new approach to computing that moves towards the limits of nanotechnology using a newly formulated sc aling rule. This is in contrast to the current computer industry scali ng away from von Neumann's original computer at the rate of Moore's Law. We extend Moore's Law to 3D, which l eads generally to architectures that integrate logic and memory. To keep pow er dissipation cons tant through a 2D surface of the 3D structure requires using adiabatic principles. We call our newly proposed architecture Processor In Memory and Storage (PIMS). We propose a new computational model that integratesmore » processing and memory into "tiles" that comprise logic, memory/storage, and communications functions. Since the programming model will be relatively stable as a system scales, programs repr esented by tiles could be executed in a PIMS system built with today's technology or could become the "schematic diagram" for implementation in an ultimate 3D nanotechnology of the future. We build a systems software approach that offers advantages over and above the technological and arch itectural advantages. Firs t, the algorithms may be more efficient in the conventional sens e of having fewer steps. Second, the algorithms may run with higher power efficiency per operation by being a better match for the adiabatic scaling ru le. The performance analysis based on demonstrated ideas in physical science suggests 80,000 x improvement in cost per operation for the (arguably) gene ral purpose function of emulating neurons in Deep Learning.« less
NASA Astrophysics Data System (ADS)
Pavlichin, Dmitri S.; Mabuchi, Hideo
2014-06-01
Nanoscale integrated photonic devices and circuits offer a path to ultra-low power computation at the few-photon level. Here we propose an optical circuit that performs a ubiquitous operation: the controlled, random-access readout of a collection of stored memory phases or, equivalently, the computation of the inner product of a vector of phases with a binary selector" vector, where the arithmetic is done modulo 2pi and the result is encoded in the phase of a coherent field. This circuit, a collection of cascaded interferometers driven by a coherent input field, demonstrates the use of coherence as a computational resource, and of the use of recently-developed mathematical tools for modeling optical circuits with many coupled parts. The construction extends in a straightforward way to the computation of matrix-vector and matrix-matrix products, and, with the inclusion of an optical feedback loop, to the computation of a weighted" readout of stored memory phases. We note some applications of these circuits for error correction and for computing tasks requiring fast vector inner products, e.g. statistical classification and some machine learning algorithms.
NASA Astrophysics Data System (ADS)
Ando, K.; Fujita, S.; Ito, J.; Yuasa, S.; Suzuki, Y.; Nakatani, Y.; Miyazaki, T.; Yoda, H.
2014-05-01
Most parts of present computer systems are made of volatile devices, and the power to supply them to avoid information loss causes huge energy losses. We can eliminate this meaningless energy loss by utilizing the non-volatile function of advanced spin-transfer torque magnetoresistive random-access memory (STT-MRAM) technology and create a new type of computer, i.e., normally off computers. Critical tasks to achieve normally off computers are implementations of STT-MRAM technologies in the main memory and low-level cache memories. STT-MRAM technology for applications to the main memory has been successfully developed by using perpendicular STT-MRAMs, and faster STT-MRAM technologies for applications to the cache memory are now being developed. The present status of STT-MRAMs and challenges that remain for normally off computers are discussed.
An Investigation of Unified Memory Access Performance in CUDA
Landaverde, Raphael; Zhang, Tiansheng; Coskun, Ayse K.; Herbordt, Martin
2015-01-01
Managing memory between the CPU and GPU is a major challenge in GPU computing. A programming model, Unified Memory Access (UMA), has been recently introduced by Nvidia to simplify the complexities of memory management while claiming good overall performance. In this paper, we investigate this programming model and evaluate its performance and programming model simplifications based on our experimental results. We find that beyond on-demand data transfers to the CPU, the GPU is also able to request subsets of data it requires on demand. This feature allows UMA to outperform full data transfer methods for certain parallel applications and small data sizes. We also find, however, that for the majority of applications and memory access patterns, the performance overheads associated with UMA are significant, while the simplifications to the programming model restrict flexibility for adding future optimizations. PMID:26594668
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets.
Bicer, Tekin; Gürsoy, Doğa; Andrade, Vincent De; Kettimuthu, Rajkumar; Scullin, William; Carlo, Francesco De; Foster, Ian T
2017-01-01
Modern synchrotron light sources and detectors produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used imaging techniques that generates data at tens of gigabytes per second is computed tomography (CT). Although CT experiments result in rapid data generation, the analysis and reconstruction of the collected data may require hours or even days of computation time with a medium-sized workstation, which hinders the scientific progress that relies on the results of analysis. We present Trace, a data-intensive computing engine that we have developed to enable high-performance implementation of iterative tomographic reconstruction algorithms for parallel computers. Trace provides fine-grained reconstruction of tomography datasets using both (thread-level) shared memory and (process-level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations that we apply to the replicated reconstruction objects and evaluate them using tomography datasets collected at the Advanced Photon Source. Our experimental evaluations show that our optimizations and parallelization techniques can provide 158× speedup using 32 compute nodes (384 cores) over a single-core configuration and decrease the end-to-end processing time of a large sinogram (with 4501 × 1 × 22,400 dimensions) from 12.5 h to <5 min per iteration. The proposed tomographic reconstruction engine can efficiently process large-scale tomographic data using many compute nodes and minimize reconstruction times.
Simulation of n-qubit quantum systems. I. Quantum registers and quantum gates
NASA Astrophysics Data System (ADS)
Radtke, T.; Fritzsche, S.
2005-12-01
During recent years, quantum computations and the study of n-qubit quantum systems have attracted a lot of interest, both in theory and experiment. Apart from the promise of performing quantum computations, however, these investigations also revealed a great deal of difficulties which still need to be solved in practice. In quantum computing, unitary and non-unitary quantum operations act on a given set of qubits to form (entangled) states, in which the information is encoded by the overall system often referred to as quantum registers. To facilitate the simulation of such n-qubit quantum systems, we present the FEYNMAN program to provide all necessary tools in order to define and to deal with quantum registers and quantum operations. Although the present version of the program is restricted to unitary transformations, it equally supports—whenever possible—the representation of the quantum registers both, in terms of their state vectors and density matrices. In addition to the composition of two or more quantum registers, moreover, the program also supports their decomposition into various parts by applying the partial trace operation and the concept of the reduced density matrix. Using an interactive design within the framework of MAPLE, therefore, we expect the FEYNMAN program to be helpful not only for teaching the basic elements of quantum computing but also for studying their physical realization in the future. Program summaryTitle of program:FEYNMAN Catalogue number:ADWE Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADWE Program obtainable from:CPC Program Library, Queen's University of Belfast, N. Ireland Licensing provisions:None Computers for which the program is designed:All computers with a license of the computer algebra system MAPLE [Maple is a registered trademark of Waterlo Maple Inc.] Operating systems or monitors under which the program has been tested:Linux, MS Windows XP Programming language used:MAPLE 9.5 (but should be compatible with 9.0 and 8.0, too) Memory and time required to execute with typical data:Storage and time requirements critically depend on the number of qubits, n, in the quantum registers due to the exponential increase of the associated Hilbert space. In particular, complex algebraic operations may require large amounts of memory even for small qubit numbers. However, most of the standard commands (see Section 4 for simple examples) react promptly for up to five qubits on a normal single-processor machine ( ⩾1GHz with 512 MB memory) and use less than 10 MB memory. No. of lines in distributed program, including test data, etc.: 8864 No. of bytes in distributed program, including test data, etc.: 493 182 Distribution format: tar.gz Nature of the physical problem:During the last decade, quantum computing has been found to provide a revolutionary new form of computation. The algorithms by Shor [P.W. Shor, SIAM J. Sci. Statist. Comput. 26 (1997) 1484] and Grover [L.K. Grover, Phys. Rev. Lett. 79 (1997) 325. [2
Hardware architecture design of a fast global motion estimation method
NASA Astrophysics Data System (ADS)
Liang, Chaobing; Sang, Hongshi; Shen, Xubang
2015-12-01
VLSI implementation of gradient-based global motion estimation (GME) faces two main challenges: irregular data access and high off-chip memory bandwidth requirement. We previously proposed a fast GME method that reduces computational complexity by choosing certain number of small patches containing corners and using them in a gradient-based framework. A hardware architecture is designed to implement this method and further reduce off-chip memory bandwidth requirement. On-chip memories are used to store coordinates of the corners and template patches, while the Gaussian pyramids of both the template and reference frame are stored in off-chip SDRAMs. By performing geometric transform only on the coordinates of the center pixel of a 3-by-3 patch in the template image, a 5-by-5 area containing the warped 3-by-3 patch in the reference image is extracted from the SDRAMs by burst read. Patched-based and burst mode data access helps to keep the off-chip memory bandwidth requirement at the minimum. Although patch size varies at different pyramid level, all patches are processed in term of 3x3 patches, so the utilization of the patch-processing circuit reaches 100%. FPGA implementation results show that the design utilizes 24,080 bits on-chip memory and for a sequence with resolution of 352x288 and frequency of 60Hz, the off-chip bandwidth requirement is only 3.96Mbyte/s, compared with 243.84Mbyte/s of the original gradient-based GME method. This design can be used in applications like video codec, video stabilization, and super-resolution, where real-time GME is a necessity and minimum memory bandwidth requirement is appreciated.
Pose Invariant Face Recognition Based on Hybrid Dominant Frequency Features
NASA Astrophysics Data System (ADS)
Wijaya, I. Gede Pasek Suta; Uchimura, Keiichi; Hu, Zhencheng
Face recognition is one of the most active research areas in pattern recognition, not only because the face is a human biometric characteristics of human being but also because there are many potential applications of the face recognition which range from human-computer interactions to authentication, security, and surveillance. This paper presents an approach to pose invariant human face image recognition. The proposed scheme is based on the analysis of discrete cosine transforms (DCT) and discrete wavelet transforms (DWT) of face images. From both the DCT and DWT domain coefficients, which describe the facial information, we build compact and meaningful features vector, using simple statistical measures and quantization. This feature vector is called as the hybrid dominant frequency features. Then, we apply a combination of the L2 and Lq metric to classify the hybrid dominant frequency features to a person's class. The aim of the proposed system is to overcome the high memory space requirement, the high computational load, and the retraining problems of previous methods. The proposed system is tested using several face databases and the experimental results are compared to a well-known Eigenface method. The proposed method shows good performance, robustness, stability, and accuracy without requiring geometrical normalization. Furthermore, the purposed method has low computational cost, requires little memory space, and can overcome retraining problem.
Application of supercomputers to computational aerodynamics
NASA Technical Reports Server (NTRS)
Peterson, V. L.
1984-01-01
Computers are playing an increasingly important role in the field of aerodynamics such that they now serve as a major complement to wind tunnels in aerospace research and development. Factors pacing advances in computational aerodynamics are identified, including the amount of computational power required to take the next major step in the discipline. Example results obtained from the successively refined forms of the governing equations are discussed, both in the context of levels of computer power required and the degree to which they either further the frontiers of research or apply to problems of practical importance. Finally, the Numerical Aerodynamic Simulation (NAS) Program - with its 1988 target of achieving a sustained computational rate of 1 billion floating point operations per second and operating with a memory of 240 million words - is discussed in terms of its goals and its projected effect on the future of computational aerodynamics.
ERIC Educational Resources Information Center
Paesler, M. A.
2009-01-01
Digital computers use different kinds of memory, each of which is either volatile or nonvolatile. On most computers only the hard drive memory is nonvolatile, i.e., it retains all information stored on it when the power is off. When a computer is turned on, an operating system stored on the hard drive is loaded into the computer's memory cache and…
Parallelization of Program to Optimize Simulated Trajectories (POST3D)
NASA Technical Reports Server (NTRS)
Hammond, Dana P.; Korte, John J. (Technical Monitor)
2001-01-01
This paper describes the parallelization of the Program to Optimize Simulated Trajectories (POST3D). POST3D uses a gradient-based optimization algorithm that reaches an optimum design point by moving from one design point to the next. The gradient calculations required to complete the optimization process, dominate the computational time and have been parallelized using a Single Program Multiple Data (SPMD) on a distributed memory NUMA (non-uniform memory access) architecture. The Origin2000 was used for the tests presented.
Interactive physically-based sound simulation
NASA Astrophysics Data System (ADS)
Raghuvanshi, Nikunj
The realization of interactive, immersive virtual worlds requires the ability to present a realistic audio experience that convincingly compliments their visual rendering. Physical simulation is a natural way to achieve such realism, enabling deeply immersive virtual worlds. However, physically-based sound simulation is very computationally expensive owing to the high-frequency, transient oscillations underlying audible sounds. The increasing computational power of desktop computers has served to reduce the gap between required and available computation, and it has become possible to bridge this gap further by using a combination of algorithmic improvements that exploit the physical, as well as perceptual properties of audible sounds. My thesis is a step in this direction. My dissertation concentrates on developing real-time techniques for both sub-problems of sound simulation: synthesis and propagation. Sound synthesis is concerned with generating the sounds produced by objects due to elastic surface vibrations upon interaction with the environment, such as collisions. I present novel techniques that exploit human auditory perception to simulate scenes with hundreds of sounding objects undergoing impact and rolling in real time. Sound propagation is the complementary problem of modeling the high-order scattering and diffraction of sound in an environment as it travels from source to listener. I discuss my work on a novel numerical acoustic simulator (ARD) that is hundred times faster and consumes ten times less memory than a high-accuracy finite-difference technique, allowing acoustic simulations on previously-intractable spaces, such as a cathedral, on a desktop computer. Lastly, I present my work on interactive sound propagation that leverages my ARD simulator to render the acoustics of arbitrary static scenes for multiple moving sources and listener in real time, while accounting for scene-dependent effects such as low-pass filtering and smooth attenuation behind obstructions, reverberation, scattering from complex geometry and sound focusing. This is enabled by a novel compact representation that takes a thousand times less memory than a direct scheme, thus reducing memory footprints to fit within available main memory. To the best of my knowledge, this is the only technique and system in existence to demonstrate auralization of physical wave-based effects in real-time on large, complex 3D scenes.
NASA Astrophysics Data System (ADS)
Chaillat, S.; Bonnet, M.; Semblat, J.
2007-12-01
Seismic wave propagation and amplification in complex media is a major issue in the field of seismology. To compute seismic wave propagation in complex geological structures such as in alluvial basins, various numerical methods have been proposed. The main advantage of the Boundary Element Method (BEM) is that only the domain boundaries (and possibly interfaces) are discretized, leading to a reduction of the number of degrees of freedom. The main drawback of the standard BEM is that the governing matrix is full and non- symmetric, which gives rise to high computational and memory costs. In other areas where the BEM is used (electromagnetism, acoustics), considerable speedup of solution time and decrease of memory requirements have been achieved through the development, over the last decade, of the Fast Multipole Method (FMM). The goal of the FMM is to speed up the matrix-vector product computation needed at each iteration of the GMRES iterative solver. Moreover, the governing matrix is never explicitly formed, which leads to a storage requirement well below the memory necessary for holding the complete matrix. The FMM-accelerated BEM therefore achieves substantial savings in both CPU time and memory. In this work, the FMM is extended to the 3-D frequency-domain elastodynamics and applied to the computation of seismic wave propagation in 3-D. The efficiency of the present FMM-BEM is demonstrated on seismology- oriented examples. First, the diffraction of a plane wave or a point source by a 3-D canyon is studied. The influence of the size of the meshed part of the free surface is studied, and computations are performed for non- dimensional frequencies higher than those considered in other studies (thanks to the use of the FM-BEM), with which comparisons are made whenever possible. The method is also applied to analyze the diffraction of a plane wave or a point source by a 3-D alluvial basin. A parametrical study is performed on the effect of the shape of the basin and the interaction of the wavefield with the basin edges is analyzed.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Betin, A Yu; Bobrinev, V I; Verenikina, N M
A multiplex method of recording computer-synthesised one-dimensional Fourier holograms intended for holographic memory devices is proposed. The method potentially allows increasing the recording density in the previously proposed holographic memory system based on the computer synthesis and projection recording of data page holograms. (holographic memory)
NASA Astrophysics Data System (ADS)
Ohene-Kwofie, Daniel; Otoo, Ekow
2015-10-01
The ATLAS detector, operated at the Large Hadron Collider (LHC) records proton-proton collisions at CERN every 50ns resulting in a sustained data flow up to PB/s. The upgraded Tile Calorimeter of the ATLAS experiment will sustain about 5PB/s of digital throughput. These massive data rates require extremely fast data capture and processing. Although there has been a steady increase in the processing speed of CPU/GPGPU assembled for high performance computing, the rate of data input and output, even under parallel I/O, has not kept up with the general increase in computing speeds. The problem then is whether one can implement an I/O subsystem infrastructure capable of meeting the computational speeds of the advanced computing systems at the petascale and exascale level. We propose a system architecture that leverages the Partitioned Global Address Space (PGAS) model of computing to maintain an in-memory data-store for the Processing Unit (PU) of the upgraded electronics of the Tile Calorimeter which is proposed to be used as a high throughput general purpose co-processor to the sROD of the upgraded Tile Calorimeter. The physical memory of the PUs are aggregated into a large global logical address space using RDMA- capable interconnects such as PCI- Express to enhance data processing throughput.
Metal oxide resistive random access memory based synaptic devices for brain-inspired computing
NASA Astrophysics Data System (ADS)
Gao, Bin; Kang, Jinfeng; Zhou, Zheng; Chen, Zhe; Huang, Peng; Liu, Lifeng; Liu, Xiaoyan
2016-04-01
The traditional Boolean computing paradigm based on the von Neumann architecture is facing great challenges for future information technology applications such as big data, the Internet of Things (IoT), and wearable devices, due to the limited processing capability issues such as binary data storage and computing, non-parallel data processing, and the buses requirement between memory units and logic units. The brain-inspired neuromorphic computing paradigm is believed to be one of the promising solutions for realizing more complex functions with a lower cost. To perform such brain-inspired computing with a low cost and low power consumption, novel devices for use as electronic synapses are needed. Metal oxide resistive random access memory (ReRAM) devices have emerged as the leading candidate for electronic synapses. This paper comprehensively addresses the recent work on the design and optimization of metal oxide ReRAM-based synaptic devices. A performance enhancement methodology and optimized operation scheme to achieve analog resistive switching and low-energy training behavior are provided. A three-dimensional vertical synapse network architecture is proposed for high-density integration and low-cost fabrication. The impacts of the ReRAM synaptic device features on the performances of neuromorphic systems are also discussed on the basis of a constructed neuromorphic visual system with a pattern recognition function. Possible solutions to achieve the high recognition accuracy and efficiency of neuromorphic systems are presented.
Managing coherence via put/get windows
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY
2011-01-11
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Managing coherence via put/get windows
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY
2012-02-21
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Performing an allreduce operation using shared memory
Archer, Charles J [Rochester, MN; Dozsa, Gabor [Ardsley, NY; Ratterman, Joseph D [Rochester, MN; Smith, Brian E [Rochester, MN
2012-04-17
Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit.
Performing an allreduce operation using shared memory
Archer, Charles J; Dozsa, Gabor; Ratterman, Joseph D; Smith, Brian E
2014-06-10
Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit.
VLSI processors for signal detection in SETI
NASA Technical Reports Server (NTRS)
Duluk, J. F.; Linscott, I. R.; Peterson, A. M.; Burr, J.; Ekroot, B.; Twicken, J.
1989-01-01
The objective of the Search for Extraterrestrial Intelligence (SETI) is to locate an artificially created signal coming from a distant star. This is done in two steps: (1) spectral analysis of an incoming radio frequency band, and (2) pattern detection for narrow-band signals. Both steps are computationally expensive and require the development of specially designed computer architectures. To reduce the size and cost of the SETI signal detection machine, two custom VLSI chips are under development. The first chip, the SETI DSP Engine, is used in the spectrum analyzer and is specially designed to compute Discrete Fourier Transforms (DFTs). It is a high-speed arithmetic processor that has two adders, one multiplier-accumulator, and three four-port memories. The second chip is a new type of Content-Addressable Memory. It is the heart of an associative processor that is used for pattern detection. Both chips incorporate many innovative circuits and architectural features.
VLSI processors for signal detection in SETI.
Duluk, J F; Linscott, I R; Peterson, A M; Burr, J; Ekroot, B; Twicken, J
1989-01-01
The objective of the Search for Extraterrestrial Intelligence (SETI) is to locate an artificially created signal coming from a distant star. This is done in two steps: (1) spectral analysis of an incoming radio frequency band, and (2) pattern detection for narrow-band signals. Both steps are computationally expensive and require the development of specially designed computer architectures. To reduce the size and cost of the SETI signal detection machine, two custom VLSI chips are under development. The first chip, the SETI DSP Engine, is used in the spectrum analyzer and is specially designed to compute Discrete Fourier Transforms (DFTs). It is a high-speed arithmetic processor that has two adders, one multiplier-accumulator, and three four-port memories. The second chip is a new type of Content-Addressable Memory. It is the heart of an associative processor that is used for pattern detection. Both chips incorporate many innovative circuits and architectural features.
Covering Resilience: A Recent Development for Binomial Checkpointing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Walther, Andrea; Narayanan, Sri Hari Krishna
In terms of computing time, adjoint methods offer a very attractive alternative to compute gradient information, required, e.g., for optimization purposes. However, together with this very favorable temporal complexity result comes a memory requirement that is in essence proportional with the operation count of the underlying function, e.g., if algorithmic differentiation is used to provide the adjoints. For this reason, checkpointing approaches in many variants have become popular. This paper analyzes an extension of the so-called binomial approach to cover also possible failures of the computing systems. Such a measure of precaution is of special interest for massive parallel simulationsmore » and adjoint calculations where the mean time between failure of the large scale computing system is smaller than the time needed to complete the calculation of the adjoint information. We describe the extensions of standard checkpointing approaches required for such resilience, provide a corresponding implementation and discuss first numerical results.« less
A class of hybrid finite element methods for electromagnetics: A review
NASA Technical Reports Server (NTRS)
Volakis, J. L.; Chatterjee, A.; Gong, J.
1993-01-01
Integral equation methods have generally been the workhorse for antenna and scattering computations. In the case of antennas, they continue to be the prominent computational approach, but for scattering applications the requirement for large-scale computations has turned researchers' attention to near neighbor methods such as the finite element method, which has low O(N) storage requirements and is readily adaptable in modeling complex geometrical features and material inhomogeneities. In this paper, we review three hybrid finite element methods for simulating composite scatterers, conformal microstrip antennas, and finite periodic arrays. Specifically, we discuss the finite element method and its application to electromagnetic problems when combined with the boundary integral, absorbing boundary conditions, and artificial absorbers for terminating the mesh. Particular attention is given to large-scale simulations, methods, and solvers for achieving low memory requirements and code performance on parallel computing architectures.
Realization of Quantum Digital Signatures without the Requirement of Quantum Memory
NASA Astrophysics Data System (ADS)
Collins, Robert J.; Donaldson, Ross J.; Dunjko, Vedran; Wallden, Petros; Clarke, Patrick J.; Andersson, Erika; Jeffers, John; Buller, Gerald S.
2014-07-01
Digital signatures are widely used to provide security for electronic communications, for example, in financial transactions and electronic mail. Currently used classical digital signature schemes, however, only offer security relying on unproven computational assumptions. In contrast, quantum digital signatures offer information-theoretic security based on laws of quantum mechanics. Here, security against forging relies on the impossibility of perfectly distinguishing between nonorthogonal quantum states. A serious drawback of previous quantum digital signature schemes is that they require long-term quantum memory, making them impractical at present. We present the first realization of a scheme that does not need quantum memory and which also uses only standard linear optical components and photodetectors. In our realization, the recipients measure the distributed quantum signature states using a new type of quantum measurement, quantum state elimination. This significantly advances quantum digital signatures as a quantum technology with potential for real applications.
Realization of quantum digital signatures without the requirement of quantum memory.
Collins, Robert J; Donaldson, Ross J; Dunjko, Vedran; Wallden, Petros; Clarke, Patrick J; Andersson, Erika; Jeffers, John; Buller, Gerald S
2014-07-25
Digital signatures are widely used to provide security for electronic communications, for example, in financial transactions and electronic mail. Currently used classical digital signature schemes, however, only offer security relying on unproven computational assumptions. In contrast, quantum digital signatures offer information-theoretic security based on laws of quantum mechanics. Here, security against forging relies on the impossibility of perfectly distinguishing between nonorthogonal quantum states. A serious drawback of previous quantum digital signature schemes is that they require long-term quantum memory, making them impractical at present. We present the first realization of a scheme that does not need quantum memory and which also uses only standard linear optical components and photodetectors. In our realization, the recipients measure the distributed quantum signature states using a new type of quantum measurement, quantum state elimination. This significantly advances quantum digital signatures as a quantum technology with potential for real applications.
NASA Astrophysics Data System (ADS)
Nebashi, Ryusuke; Sakimura, Noboru; Sugibayashi, Tadahiko
2017-08-01
We evaluated the soft-error tolerance and energy consumption of an embedded computer with magnetic random access memory (MRAM) using two computer simulators. One is a central processing unit (CPU) simulator of a typical embedded computer system. We simulated the radiation-induced single-event-upset (SEU) probability in a spin-transfer-torque MRAM cell and also the failure rate of a typical embedded computer due to its main memory SEU error. The other is a delay tolerant network (DTN) system simulator. It simulates the power dissipation of wireless sensor network nodes of the system using a revised CPU simulator and a network simulator. We demonstrated that the SEU effect on the embedded computer with 1 Gbit MRAM-based working memory is less than 1 failure in time (FIT). We also demonstrated that the energy consumption of the DTN sensor node with MRAM-based working memory can be reduced to 1/11. These results indicate that MRAM-based working memory enhances the disaster tolerance of embedded computers.
Multiprocessor shared-memory information exchange
DOE Office of Scientific and Technical Information (OSTI.GOV)
Santoline, L.L.; Bowers, M.D.; Crew, A.W.
1989-02-01
In distributed microprocessor-based instrumentation and control systems, the inter-and intra-subsystem communication requirements ultimately form the basis for the overall system architecture. This paper describes a software protocol which addresses the intra-subsystem communications problem. Specifically the protocol allows for multiple processors to exchange information via a shared-memory interface. The authors primary goal is to provide a reliable means for information to be exchanged between central application processor boards (masters) and dedicated function processor boards (slaves) in a single computer chassis. The resultant Multiprocessor Shared-Memory Information Exchange (MSMIE) protocol, a standard master-slave shared-memory interface suitable for use in nuclear safety systems, ismore » designed to pass unidirectional buffers of information between the processors while providing a minimum, deterministic cycle time for this data exchange.« less
NASA Technical Reports Server (NTRS)
Denning, P. J.
1986-01-01
Virtual memory was conceived as a way to automate overlaying of program segments. Modern computers have very large main memories, but need automatic solutions to the relocation and protection problems. Virtual memory serves this need as well and is thus useful in computers of all sizes. The history of the idea is traced, showing how it has become a widespread, little noticed feature of computers today.
Cognitive Assessment of Movement-Based Computer Games
NASA Technical Reports Server (NTRS)
Kearney, Paul
2008-01-01
This paper examines the possibility that dance games such as Dance Dance Revolution or StepMania enhance the cognitive abilities that are critical to academic achievement. These games appear to place a high cognitive load on working memory requiring the player to convert a visual signal to a physical movement up to 7 times per second. Players see a pattern of directions displayed on the screen and they memorise these as a dance sequence. Other researchers have found that attention span and memory ability, both cognitive abilities required for academic achievement, are improved through the use of physical movement and exercise. This paper reviews these claims and documents tool development for on-going research by the author.
Stochastic quasi-Newton molecular simulations
NASA Astrophysics Data System (ADS)
Chau, C. D.; Sevink, G. J. A.; Fraaije, J. G. E. M.
2010-08-01
We report a new and efficient factorized algorithm for the determination of the adaptive compound mobility matrix B in a stochastic quasi-Newton method (S-QN) that does not require additional potential evaluations. For one-dimensional and two-dimensional test systems, we previously showed that S-QN gives rise to efficient configurational space sampling with good thermodynamic consistency [C. D. Chau, G. J. A. Sevink, and J. G. E. M. Fraaije, J. Chem. Phys. 128, 244110 (2008)10.1063/1.2943313]. Potential applications of S-QN are quite ambitious, and include structure optimization, analysis of correlations and automated extraction of cooperative modes. However, the potential can only be fully exploited if the computational and memory requirements of the original algorithm are significantly reduced. In this paper, we consider a factorized mobility matrix B=JJT and focus on the nontrivial fundamentals of an efficient algorithm for updating the noise multiplier J . The new algorithm requires O(n2) multiplications per time step instead of the O(n3) multiplications in the original scheme due to Choleski decomposition. In a recursive form, the update scheme circumvents matrix storage and enables limited-memory implementation, in the spirit of the well-known limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method, allowing for a further reduction of the computational effort to O(n) . We analyze in detail the performance of the factorized (FSU) and limited-memory (L-FSU) algorithms in terms of convergence and (multiscale) sampling, for an elementary but relevant system that involves multiple time and length scales. Finally, we use this analysis to formulate conditions for the simulation of the complex high-dimensional potential energy landscapes of interest.
APPLICATION OF A FINITE-DIFFERENCE TECHNIQUE TO THE HUMAN RADIOFREQUENCY DOSIMETRY PROBLEM
A powerful finite difference numerical technique has been applied to the human radiofrequency dosimetry problem. The method possesses inherent advantages over the method of moments approach in that its implementation requires much less computer memory. Consequently, it has the ca...
Simple and Efficient Single Photon Filter for a Rb-based Quantum Memory
NASA Astrophysics Data System (ADS)
Stack, Daniel; Li, Xiao; Quraishi, Qudsia
2015-05-01
Distribution of entangled quantum states over significant distances is important to the development of future quantum technologies such as long-distance cryptography, networks of atomic clocks, distributed quantum computing, etc. Long-lived quantum memories and single photons are building blocks for systems capable of realizing such applications. The ability to store and retrieve quantum information while filtering unwanted light signals is critical to the operation of quantum memories based on neutral-atom ensembles. We report on an efficient frequency filter which uses a glass cell filled with 85Rb vapor to attenuate noise photons by an order of magnitude with little loss to the single photons associated with the operation of our cold 87Rb quantum memory. An Ar buffer gas is required to differentiate between signal and noise photons or similar statement. Our simple, passive filter requires no optical pumping or external frequency references and provides an additional 18 dB attenuation of our pump laser for every 1 dB loss of the single photon signal. We observe improved non-classical correlations and our data shows that the addition of a frequency filter increases the non-classical correlations and readout efficiency of our quantum memory by ~ 35%.
Swanson, H Lee; Xinhua Zheng; Jerman, Olga
2009-01-01
The purpose of the present study was to synthesize research that compares children with and without reading disabilities (RD) on measures of short-term memory (STM) and working memory (WM). Across a broad age, reading, and IQ range, 578 effect sizes (ESs) were computed, yielding a mean ES across studies of -.89 (SD = 1.03). A total of 257 ESs were in the moderate range for STM measures (M = -.61, 95% confidence range of -.65 to -.58), and 320 ESs were in the moderate range for WM measures (M = -.67, 95% confidence range of -.68 to -.64). The results indicated that children with RD were distinctively disadvantaged compared with average readers on (a) STM measures requiring the recall of phonemes and digit sequences and (b) WM measures requiring the simultaneous processing and storage of digits within sentence sequences and final words from unrelated sentences. No significant moderating effects emerged for age, IQ, or reading level on memory ESs. The findings indicated that domain-specific STM and WM differences between ability groups persisted across age, suggesting that a verbal deficit model that fails to efficiently draw resources from both a phonological and executive system underlies RD.
Sparse distributed memory overview
NASA Technical Reports Server (NTRS)
Raugh, Mike
1990-01-01
The Sparse Distributed Memory (SDM) project is investigating the theory and applications of massively parallel computing architecture, called sparse distributed memory, that will support the storage and retrieval of sensory and motor patterns characteristic of autonomous systems. The immediate objectives of the project are centered in studies of the memory itself and in the use of the memory to solve problems in speech, vision, and robotics. Investigation of methods for encoding sensory data is an important part of the research. Examples of NASA missions that may benefit from this work are Space Station, planetary rovers, and solar exploration. Sparse distributed memory offers promising technology for systems that must learn through experience and be capable of adapting to new circumstances, and for operating any large complex system requiring automatic monitoring and control. Sparse distributed memory is a massively parallel architecture motivated by efforts to understand how the human brain works. Sparse distributed memory is an associative memory, able to retrieve information from cues that only partially match patterns stored in the memory. It is able to store long temporal sequences derived from the behavior of a complex system, such as progressive records of the system's sensory data and correlated records of the system's motor controls.
Paying attention to attention in recognition memory: insights from models and electrophysiology.
Dubé, Chad; Payne, Lisa; Sekuler, Robert; Rotello, Caren M
2013-12-01
Reliance on remembered facts or events requires memory for their sources, that is, the contexts in which those facts or events were embedded. Understanding of source retrieval has been stymied by the fact that uncontrolled fluctuations of attention during encoding can cloud results of key importance to theoretical development. To address this issue, we combined electrophysiology (high-density electroencephalogram, EEG, recordings) with computational modeling of behavioral results. We manipulated subjects' attention to an auditory attribute, whether the source of individual study words was a male or female speaker. Posterior alpha-band (8-14 Hz) power in subjects' EEG increased after a cue to ignore the voice of the person who was about to speak. Receiver-operating-characteristic analysis validated our interpretation of oscillatory dynamics as a marker of attention to source information. With attention under experimental control, computational modeling showed unequivocally that memory for source (male or female speaker) reflected a continuous signal detection process rather than a threshold recollection process.
A mixed parallel strategy for the solution of coupled multi-scale problems at finite strains
NASA Astrophysics Data System (ADS)
Lopes, I. A. Rodrigues; Pires, F. M. Andrade; Reis, F. J. P.
2018-02-01
A mixed parallel strategy for the solution of homogenization-based multi-scale constitutive problems undergoing finite strains is proposed. The approach aims to reduce the computational time and memory requirements of non-linear coupled simulations that use finite element discretization at both scales (FE^2). In the first level of the algorithm, a non-conforming domain decomposition technique, based on the FETI method combined with a mortar discretization at the interface of macroscopic subdomains, is employed. A master-slave scheme, which distributes tasks by macroscopic element and adopts dynamic scheduling, is then used for each macroscopic subdomain composing the second level of the algorithm. This strategy allows the parallelization of FE^2 simulations in computers with either shared memory or distributed memory architectures. The proposed strategy preserves the quadratic rates of asymptotic convergence that characterize the Newton-Raphson scheme. Several examples are presented to demonstrate the robustness and efficiency of the proposed parallel strategy.
NASA Astrophysics Data System (ADS)
Lu, Bin; Cheng, Xiaomin; Feng, Jinlong; Guan, Xiawei; Miao, Xiangshui
2016-07-01
Nonvolatile memory devices or circuits that can implement both storage and calculation are a crucial requirement for the efficiency improvement of modern computer. In this work, we realize logic functions by using [GeTe/Sb2Te3]n super lattice phase change memory (PCM) cell in which higher threshold voltage is needed for phase change with a magnetic field applied. First, the [GeTe/Sb2Te3]n super lattice cells were fabricated and the R-V curve was measured. Then we designed the logic circuits with the super lattice PCM cell verified by HSPICE simulation and experiments. Seven basic logic functions are first demonstrated in this letter; then several multi-input logic gates are presented. The proposed logic devices offer the advantages of simple structures and low power consumption, indicating that the super lattice PCM has the potential in the future nonvolatile central processing unit design, facilitating the development of massive parallel computing architecture.
Code of Federal Regulations, 2012 CFR
2012-07-01
... of computer codes. The emission control diagnostic system shall record and store in computer memory..., shall be stored in computer memory to identify correctly functioning emission control systems and those... in computer memory. Should a subsequent fuel system or misfire malfunction occur, any previously...
Code of Federal Regulations, 2013 CFR
2013-07-01
... of computer codes. The emission control diagnostic system shall record and store in computer memory..., shall be stored in computer memory to identify correctly functioning emission control systems and those... in computer memory. Should a subsequent fuel system or misfire malfunction occur, any previously...
Visual Memories Bypass Normalization.
Bloem, Ilona M; Watanabe, Yurika L; Kibbe, Melissa M; Ling, Sam
2018-05-01
How distinct are visual memory representations from visual perception? Although evidence suggests that briefly remembered stimuli are represented within early visual cortices, the degree to which these memory traces resemble true visual representations remains something of a mystery. Here, we tested whether both visual memory and perception succumb to a seemingly ubiquitous neural computation: normalization. Observers were asked to remember the contrast of visual stimuli, which were pitted against each other to promote normalization either in perception or in visual memory. Our results revealed robust normalization between visual representations in perception, yet no signature of normalization occurring between working memory stores-neither between representations in memory nor between memory representations and visual inputs. These results provide unique insight into the nature of visual memory representations, illustrating that visual memory representations follow a different set of computational rules, bypassing normalization, a canonical visual computation.
Visual Memories Bypass Normalization
Bloem, Ilona M.; Watanabe, Yurika L.; Kibbe, Melissa M.; Ling, Sam
2018-01-01
How distinct are visual memory representations from visual perception? Although evidence suggests that briefly remembered stimuli are represented within early visual cortices, the degree to which these memory traces resemble true visual representations remains something of a mystery. Here, we tested whether both visual memory and perception succumb to a seemingly ubiquitous neural computation: normalization. Observers were asked to remember the contrast of visual stimuli, which were pitted against each other to promote normalization either in perception or in visual memory. Our results revealed robust normalization between visual representations in perception, yet no signature of normalization occurring between working memory stores—neither between representations in memory nor between memory representations and visual inputs. These results provide unique insight into the nature of visual memory representations, illustrating that visual memory representations follow a different set of computational rules, bypassing normalization, a canonical visual computation. PMID:29596038
NASA Technical Reports Server (NTRS)
Chan, Daniel C.; Darian, Armen; Sindir, Munir
1992-01-01
We have applied and compared the efficiency and accuracy of two commonly used numerical methods for the solution of Navier-Stokes equations. The artificial compressibility method augments the continuity equation with a transient pressure term and allows one to solve the modified equations as a coupled system. Due to its implicit nature, one can have the luxury of taking a large temporal integration step at the expense of higher memory requirement and larger operation counts per step. Meanwhile, the fractional step method splits the Navier-Stokes equations into a sequence of differential operators and integrates them in multiple steps. The memory requirement and operation count per time step are low, however, the restriction on the size of time marching step is more severe. To explore the strengths and weaknesses of these two methods, we used them for the computation of a two-dimensional driven cavity flow with Reynolds number of 100 and 1000, respectively. Three grid sizes, 41 x 41, 81 x 81, and 161 x 161 were used. The computations were considered after the L2-norm of the change of the dependent variables in two consecutive time steps has fallen below 10(exp -5).
Navier-Stokes computations useful in aircraft design
NASA Technical Reports Server (NTRS)
Holst, Terry L.
1990-01-01
Large scale Navier-Stokes computations about aircraft components as well as reasonably complete aircraft configurations are presented and discussed. Speed and memory requirements are described for various general problem classes, which in some cases are already being used in the industrial design environment. Recent computed results, with experimental comparisons when available, are included to highlight the presentation. Finally, prospects for the future are described and recommendations for areas of concentrated research are indicated. The future of Navier-Stokes computations is seen to be rapidly expanding across a broad front of applications, which includes the entire subsonic-to-hypersonic speed regime.
NASA Astrophysics Data System (ADS)
Strotov, Valery V.; Taganov, Alexander I.; Konkin, Yuriy V.; Kolesenkov, Aleksandr N.
2017-10-01
Task of processing and analysis of obtained Earth remote sensing data on ultra-small spacecraft board is actual taking into consideration significant expenditures of energy for data transfer and low productivity of computers. Thereby, there is an issue of effective and reliable storage of the general information flow obtained from onboard systems of information collection, including Earth remote sensing data, into a specialized data base. The paper has considered peculiarities of database management system operation with the multilevel memory structure. For storage of data in data base the format has been developed that describes a data base physical structure which contains required parameters for information loading. Such structure allows reducing a memory size occupied by data base because it is not necessary to store values of keys separately. The paper has shown architecture of the relational database management system oriented into embedment into the onboard ultra-small spacecraft software. Data base for storage of different information, including Earth remote sensing data, can be developed by means of such database management system for its following processing. Suggested database management system architecture has low requirements to power of the computer systems and memory resources on the ultra-small spacecraft board. Data integrity is ensured under input and change of the structured information.
Elfman, Kane W; Aly, Mariam; Yonelinas, Andrew P
2014-12-01
Recent evidence suggests that the hippocampus, a region critical for long-term memory, also supports certain forms of high-level visual perception. A seemingly paradoxical finding is that, unlike the thresholded hippocampal signals associated with memory, the hippocampus produces graded, strength-based signals in perception. This article tests a neurocomputational model of the hippocampus, based on the complementary learning systems framework, to determine if the same model can account for both memory and perception, and whether it produces the appropriate thresholded and strength-based signals in these two types of tasks. The simulations showed that the hippocampus, and most prominently the CA1 subfield, produced graded signals when required to discriminate between highly similar stimuli in a perception task, but generated thresholded patterns of activity in recognition memory. A threshold was observed in recognition memory because pattern completion occurred for only some trials and completely failed to occur for others; conversely, in perception, pattern completion always occurred because of the high degree of item similarity. These results offer a neurocomputational account of the distinct hippocampal signals associated with perception and memory, and are broadly consistent with proposals that CA1 functions as a comparator of expected versus perceived events. We conclude that the hippocampal computations required for high-level perceptual discrimination are congruous with current neurocomputational models that account for recognition memory, and fit neatly into a broader description of the role of the hippocampus for the processing of complex relational information. © 2014 Wiley Periodicals, Inc.
Execution time supports for adaptive scientific algorithms on distributed memory machines
NASA Technical Reports Server (NTRS)
Berryman, Harry; Saltz, Joel; Scroggs, Jeffrey
1990-01-01
Optimizations are considered that are required for efficient execution of code segments that consists of loops over distributed data structures. The PARTI (Parallel Automated Runtime Toolkit at ICASE) execution time primitives are designed to carry out these optimizations and can be used to implement a wide range of scientific algorithms on distributed memory machines. These primitives allow the user to control array mappings in a way that gives an appearance of shared memory. Computations can be based on a global index set. Primitives are used to carry out gather and scatter operations on distributed arrays. Communications patterns are derived at runtime, and the appropriate send and receive messages are automatically generated.
Execution time support for scientific programs on distributed memory machines
NASA Technical Reports Server (NTRS)
Berryman, Harry; Saltz, Joel; Scroggs, Jeffrey
1990-01-01
Optimizations are considered that are required for efficient execution of code segments that consists of loops over distributed data structures. The PARTI (Parallel Automated Runtime Toolkit at ICASE) execution time primitives are designed to carry out these optimizations and can be used to implement a wide range of scientific algorithms on distributed memory machines. These primitives allow the user to control array mappings in a way that gives an appearance of shared memory. Computations can be based on a global index set. Primitives are used to carry out gather and scatter operations on distributed arrays. Communications patterns are derived at runtime, and the appropriate send and receive messages are automatically generated.
A chiral-based magnetic memory device without a permanent magnet
Dor, Oren Ben; Yochelis, Shira; Mathew, Shinto P.; Naaman, Ron; Paltiel, Yossi
2013-01-01
Several technologies are currently in use for computer memory devices. However, there is a need for a universal memory device that has high density, high speed and low power requirements. To this end, various types of magnetic-based technologies with a permanent magnet have been proposed. Recent charge-transfer studies indicate that chiral molecules act as an efficient spin filter. Here we utilize this effect to achieve a proof of concept for a new type of chiral-based magnetic-based Si-compatible universal memory device without a permanent magnet. More specifically, we use spin-selective charge transfer through a self-assembled monolayer of polyalanine to magnetize a Ni layer. This magnitude of magnetization corresponds to applying an external magnetic field of 0.4 T to the Ni layer. The readout is achieved using low currents. The presented technology has the potential to overcome the limitations of other magnetic-based memory technologies to allow fabricating inexpensive, high-density universal memory-on-chip devices. PMID:23922081
A chiral-based magnetic memory device without a permanent magnet.
Ben Dor, Oren; Yochelis, Shira; Mathew, Shinto P; Naaman, Ron; Paltiel, Yossi
2013-01-01
Several technologies are currently in use for computer memory devices. However, there is a need for a universal memory device that has high density, high speed and low power requirements. To this end, various types of magnetic-based technologies with a permanent magnet have been proposed. Recent charge-transfer studies indicate that chiral molecules act as an efficient spin filter. Here we utilize this effect to achieve a proof of concept for a new type of chiral-based magnetic-based Si-compatible universal memory device without a permanent magnet. More specifically, we use spin-selective charge transfer through a self-assembled monolayer of polyalanine to magnetize a Ni layer. This magnitude of magnetization corresponds to applying an external magnetic field of 0.4 T to the Ni layer. The readout is achieved using low currents. The presented technology has the potential to overcome the limitations of other magnetic-based memory technologies to allow fabricating inexpensive, high-density universal memory-on-chip devices.
Implementing Molecular Dynamics on Hybrid High Performance Computers - Three-Body Potentials
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown, W Michael; Yamada, Masako
The use of coprocessors or accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power re- quirements. Hybrid high-performance computers, defined as machines with nodes containing more than one type of floating-point processor (e.g. CPU and GPU), are now becoming more prevalent due to these advantages. Although there has been extensive research into methods to efficiently use accelerators to improve the performance of molecular dynamics (MD) employing pairwise potential energy models, little is reported in the literature for models that includemore » many-body effects. 3-body terms are required for many popular potentials such as MEAM, Tersoff, REBO, AIREBO, Stillinger-Weber, Bond-Order Potentials, and others. Because the per-atom simulation times are much higher for models incorporating 3-body terms, there is a clear need for efficient algo- rithms usable on hybrid high performance computers. Here, we report a shared-memory force-decomposition for 3-body potentials that avoids memory conflicts to allow for a deterministic code with substantial performance improvements on hybrid machines. We describe modifications necessary for use in distributed memory MD codes and show results for the simulation of water with Stillinger-Weber on the hybrid Titan supercomputer. We compare performance of the 3-body model to the SPC/E water model when using accelerators. Finally, we demonstrate that our approach can attain a speedup of 5.1 with acceleration on Titan for production simulations to study water droplet freezing on a surface.« less
Topics in Semantic Representation
ERIC Educational Resources Information Center
Griffiths, Thomas L.; Steyvers, Mark; Tenenbaum, Joshua B.
2007-01-01
Processing language requires the retrieval of concepts from memory in response to an ongoing stream of information. This retrieval is facilitated if one can infer the gist of a sentence, conversation, or document and use that gist to predict related concepts and disambiguate words. This article analyzes the abstract computational problem…
1995-01-01
possible to determine communication points. For this version, a C program spawning Posix threads and using semaphores to synchronize would have to...performance such as the time required for network communication and synchronization as well as issues of asynchrony and memory hierarchy. For example...enhances reusability. Process (or task) parallel computations can also be succinctly expressed with a small set of process creation and synchronization
ERIC Educational Resources Information Center
Deryakulu, Deniz; Olkun, Sinan
2009-01-01
This study examined Turkish computer teachers' professional memories telling of their experiences with school administrators and supervisors. Seventy-four computer teachers participated in the study. Content analysis of the memories revealed that the most frequently mentioned themes concerning school administrators were "unsupportive…
Integrating Commercial Off-The-Shelf (COTS) graphics and extended memory packages with CLIPS
NASA Technical Reports Server (NTRS)
Callegari, Andres C.
1990-01-01
This paper addresses the question of how to mix CLIPS with graphics and how to overcome PC's memory limitations by using the extended memory available in the computer. By adding graphics and extended memory capabilities, CLIPS can be converted into a complete and powerful system development tool, on the other most economical and popular computer platform. New models of PCs have amazing processing capabilities and graphic resolutions that cannot be ignored and should be used to the fullest of their resources. CLIPS is a powerful expert system development tool, but it cannot be complete without the support of a graphics package needed to create user interfaces and general purpose graphics, or without enough memory to handle large knowledge bases. Now, a well known limitation on the PC's is the usage of real memory which limits CLIPS to use only 640 Kb of real memory, but now that problem can be solved by developing a version of CLIPS that uses extended memory. The user has access of up to 16 MB of memory on 80286 based computers and, practically, all the available memory (4 GB) on computers that use the 80386 processor. So if we give CLIPS a self-configuring graphics package that will automatically detect the graphics hardware and pointing device present in the computer, and we add the availability of the extended memory that exists in the computer (with no special hardware needed), the user will be able to create more powerful systems at a fraction of the cost and on the most popular, portable, and economic platform available such as the PC platform.
A Case Study on Neural Inspired Dynamic Memory Management Strategies for High Performance Computing.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vineyard, Craig Michael; Verzi, Stephen Joseph
As high performance computing architectures pursue more computational power there is a need for increased memory capacity and bandwidth as well. A multi-level memory (MLM) architecture addresses this need by combining multiple memory types with different characteristics as varying levels of the same architecture. How to efficiently utilize this memory infrastructure is an unknown challenge, and in this research we sought to investigate whether neural inspired approaches can meaningfully help with memory management. In particular we explored neurogenesis inspired re- source allocation, and were able to show a neural inspired mixed controller policy can beneficially impact how MLM architectures utilizemore » memory.« less
Computer hardware for radiologists: Part 2.
Indrajit, Ik; Alam, A
2010-11-01
Computers are an integral part of modern radiology equipment. In the first half of this two-part article, we dwelt upon some fundamental concepts regarding computer hardware, covering components like motherboard, central processing unit (CPU), chipset, random access memory (RAM), and memory modules. In this article, we describe the remaining computer hardware components that are of relevance to radiology. "Storage drive" is a term describing a "memory" hardware used to store data for later retrieval. Commonly used storage drives are hard drives, floppy drives, optical drives, flash drives, and network drives. The capacity of a hard drive is dependent on many factors, including the number of disk sides, number of tracks per side, number of sectors on each track, and the amount of data that can be stored in each sector. "Drive interfaces" connect hard drives and optical drives to a computer. The connections of such drives require both a power cable and a data cable. The four most popular "input/output devices" used commonly with computers are the printer, monitor, mouse, and keyboard. The "bus" is a built-in electronic signal pathway in the motherboard to permit efficient and uninterrupted data transfer. A motherboard can have several buses, including the system bus, the PCI express bus, the PCI bus, the AGP bus, and the (outdated) ISA bus. "Ports" are the location at which external devices are connected to a computer motherboard. All commonly used peripheral devices, such as printers, scanners, and portable drives, need ports. A working knowledge of computers is necessary for the radiologist if the workflow is to realize its full potential and, besides, this knowledge will prepare the radiologist for the coming innovations in the 'ever increasing' digital future.
Efficient parallelization for AMR MHD multiphysics calculations; implementation in AstroBEAR
NASA Astrophysics Data System (ADS)
Carroll-Nellenback, Jonathan J.; Shroyer, Brandon; Frank, Adam; Ding, Chen
2013-03-01
Current adaptive mesh refinement (AMR) simulations require algorithms that are highly parallelized and manage memory efficiently. As compute engines grow larger, AMR simulations will require algorithms that achieve new levels of efficient parallelization and memory management. We have attempted to employ new techniques to achieve both of these goals. Patch or grid based AMR often employs ghost cells to decouple the hyperbolic advances of each grid on a given refinement level. This decoupling allows each grid to be advanced independently. In AstroBEAR we utilize this independence by threading the grid advances on each level with preference going to the finer level grids. This allows for global load balancing instead of level by level load balancing and allows for greater parallelization across both physical space and AMR level. Threading of level advances can also improve performance by interleaving communication with computation, especially in deep simulations with many levels of refinement. While we see improvements of up to 30% on deep simulations run on a few cores, the speedup is typically more modest (5-20%) for larger scale simulations. To improve memory management we have employed a distributed tree algorithm that requires processors to only store and communicate local sections of the AMR tree structure with neighboring processors. Using this distributed approach we are able to get reasonable scaling efficiency (>80%) out to 12288 cores and up to 8 levels of AMR - independent of the use of threading.
Stability of discrete memory states to stochastic fluctuations in neuronal systems
Miller, Paul; Wang, Xiao-Jing
2014-01-01
Noise can degrade memories by causing transitions from one memory state to another. For any biological memory system to be useful, the time scale of such noise-induced transitions must be much longer than the required duration for memory retention. Using biophysically-realistic modeling, we consider two types of memory in the brain: short-term memories maintained by reverberating neuronal activity for a few seconds, and long-term memories maintained by a molecular switch for years. Both systems require persistence of (neuronal or molecular) activity self-sustained by an autocatalytic process and, we argue, that both have limited memory lifetimes because of significant fluctuations. We will first discuss a strongly recurrent cortical network model endowed with feedback loops, for short-term memory. Fluctuations are due to highly irregular spike firing, a salient characteristic of cortical neurons. Then, we will analyze a model for long-term memory, based on an autophosphorylation mechanism of calcium/calmodulin-dependent protein kinase II (CaMKII) molecules. There, fluctuations arise from the fact that there are only a small number of CaMKII molecules at each postsynaptic density (putative synaptic memory unit). Our results are twofold. First, we demonstrate analytically and computationally the exponential dependence of stability on the number of neurons in a self-excitatory network, and on the number of CaMKII proteins in a molecular switch. Second, for each of the two systems, we implement graded memory consisting of a group of bistable switches. For the neuronal network we report interesting ramping temporal dynamics as a result of sequentially switching an increasing number of discrete, bistable, units. The general observation of an exponential increase in memory stability with the system size leads to a trade-off between the robustness of memories (which increases with the size of each bistable unit) and the total amount of information storage (which decreases with increasing unit size), which may be optimized in the brain through biological evolution. PMID:16822041
Optical interconnection networks for high-performance computing systems
NASA Astrophysics Data System (ADS)
Biberman, Aleksandr; Bergman, Keren
2012-04-01
Enabled by silicon photonic technology, optical interconnection networks have the potential to be a key disruptive technology in computing and communication industries. The enduring pursuit of performance gains in computing, combined with stringent power constraints, has fostered the ever-growing computational parallelism associated with chip multiprocessors, memory systems, high-performance computing systems and data centers. Sustaining these parallelism growths introduces unique challenges for on- and off-chip communications, shifting the focus toward novel and fundamentally different communication approaches. Chip-scale photonic interconnection networks, enabled by high-performance silicon photonic devices, offer unprecedented bandwidth scalability with reduced power consumption. We demonstrate that the silicon photonic platforms have already produced all the high-performance photonic devices required to realize these types of networks. Through extensive empirical characterization in much of our work, we demonstrate such feasibility of waveguides, modulators, switches and photodetectors. We also demonstrate systems that simultaneously combine many functionalities to achieve more complex building blocks. We propose novel silicon photonic devices, subsystems, network topologies and architectures to enable unprecedented performance of these photonic interconnection networks. Furthermore, the advantages of photonic interconnection networks extend far beyond the chip, offering advanced communication environments for memory systems, high-performance computing systems, and data centers.
Acceleration of FDTD mode solver by high-performance computing techniques.
Han, Lin; Xi, Yanping; Huang, Wei-Ping
2010-06-21
A two-dimensional (2D) compact finite-difference time-domain (FDTD) mode solver is developed based on wave equation formalism in combination with the matrix pencil method (MPM). The method is validated for calculation of both real guided and complex leaky modes of typical optical waveguides against the bench-mark finite-difference (FD) eigen mode solver. By taking advantage of the inherent parallel nature of the FDTD algorithm, the mode solver is implemented on graphics processing units (GPUs) using the compute unified device architecture (CUDA). It is demonstrated that the high-performance computing technique leads to significant acceleration of the FDTD mode solver with more than 30 times improvement in computational efficiency in comparison with the conventional FDTD mode solver running on CPU of a standard desktop computer. The computational efficiency of the accelerated FDTD method is in the same order of magnitude of the standard finite-difference eigen mode solver and yet require much less memory (e.g., less than 10%). Therefore, the new method may serve as an efficient, accurate and robust tool for mode calculation of optical waveguides even when the conventional eigen value mode solvers are no longer applicable due to memory limitation.
DMA shared byte counters in a parallel computer
Chen, Dong; Gara, Alan G.; Heidelberger, Philip; Vranas, Pavlos
2010-04-06
A parallel computer system is constructed as a network of interconnected compute nodes. Each of the compute nodes includes at least one processor, a memory and a DMA engine. The DMA engine includes a processor interface for interfacing with the at least one processor, DMA logic, a memory interface for interfacing with the memory, a DMA network interface for interfacing with the network, injection and reception byte counters, injection and reception FIFO metadata, and status registers and control registers. The injection FIFOs maintain memory locations of the injection FIFO metadata memory locations including its current head and tail, and the reception FIFOs maintain the reception FIFO metadata memory locations including its current head and tail. The injection byte counters and reception byte counters may be shared between messages.
NASA Technical Reports Server (NTRS)
1996-01-01
Through Goddard Space Flight Center and Jet Propulsion Laboratory Small Business Innovation Research contracts, Irvine Sensors developed a three-dimensional memory system for a spaceborne data recorder and other applications for NASA. From these contracts, the company created the Memory Short Stack product, a patented technology for stacking integrated circuits that offers higher processing speeds and levels of integration, and lower power requirements. The product is a three-dimensional semiconductor package in which dozens of integrated circuits are stacked upon each other to form a cube. The technology is being used in various computer and telecommunications applications.
An iterative approach to region growing using associative memories
NASA Technical Reports Server (NTRS)
Snyder, W. E.; Cowart, A.
1983-01-01
Region growing, often given as a classical example of the recursive control structures used in image processing which are often awkward to implement in hardware where the intent is the segmentation of an image at raster scan rates, is addressed in light of the postulate that any computation which can be performed recursively can be performed easily and efficiently by iteration coupled with association. Attention is given to an algorithm and hardware structure able to perform region labeling iteratively at scan rates. Every pixel is individually labeled with an identifier which signifies the region to which it belongs. Difficulties otherwise requiring recursion are handled by maintaining an equivalence table in hardware transparent to the computer, which reads the labeled pixels. A simulation of the associative memory has demonstrated its effectiveness.
A high performance parallel algorithm for 1-D FFT
DOE Office of Scientific and Technical Information (OSTI.GOV)
Agarwal, R.C.; Gustavson, F.G.; Zubair, M.
1994-12-31
In this paper the authors propose a parallel high performance FFT algorithm based on a multi-dimensional formulation. They use this to solve a commonly encountered FFT based kernel on a distributed memory parallel machine, the IBM scalable parallel system, SP1. The kernel requires a forward FFT computation of an input sequence, multiplication of the transformed data by a coefficient array, and finally an inverse FFT computation of the resultant data. They show that the multi-dimensional formulation helps in reducing the communication costs and also improves the single node performance by effectively utilizing the memory system of the node. They implementedmore » this kernel on the IBM SP1 and observed a performance of 1.25 GFLOPS on a 64-node machine.« less
Simplifying and speeding the management of intra-node cache coherence
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Phillip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY
2012-04-17
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Managing coherence via put/get windows
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blumrich, Matthias A; Chen, Dong; Coteus, Paul W
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an areamore » of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Perumalla, Kalyan S.; Yoginath, Srikanth B.
Problems such as fault tolerance and scalable synchronization can be efficiently solved using reversibility of applications. Making applications reversible by relying on computation rather than on memory is ideal for large scale parallel computing, especially for the next generation of supercomputers in which memory is expensive in terms of latency, energy, and price. In this direction, a case study is presented here in reversing a computational core, namely, Basic Linear Algebra Subprograms, which is widely used in scientific applications. A new Reversible BLAS (RBLAS) library interface has been designed, and a prototype has been implemented with two modes: (1) amore » memory-mode in which reversibility is obtained by checkpointing to memory in forward and restoring from memory in reverse, and (2) a computational-mode in which nothing is saved in the forward, but restoration is done entirely via inverse computation in reverse. The article is focused on detailed performance benchmarking to evaluate the runtime dynamics and performance effects, comparing reversible computation with checkpointing on both traditional CPU platforms and recent GPU accelerator platforms. For BLAS Level-1 subprograms, data indicates over an order of magnitude better speed of reversible computation compared to checkpointing. For BLAS Level-2 and Level-3, a more complex tradeoff is observed between reversible computation and checkpointing, depending on computational and memory complexities of the subprograms.« less
Márquez Neila, Pablo; Baumela, Luis; González-Soriano, Juncal; Rodríguez, Jose-Rodrigo; DeFelipe, Javier; Merchán-Pérez, Ángel
2016-04-01
Recent electron microscopy (EM) imaging techniques permit the automatic acquisition of a large number of serial sections from brain samples. Manual segmentation of these images is tedious, time-consuming and requires a high degree of user expertise. Therefore, there is considerable interest in developing automatic segmentation methods. However, currently available methods are computationally demanding in terms of computer time and memory usage, and to work properly many of them require image stacks to be isotropic, that is, voxels must have the same size in the X, Y and Z axes. We present a method that works with anisotropic voxels and that is computationally efficient allowing the segmentation of large image stacks. Our approach involves anisotropy-aware regularization via conditional random field inference and surface smoothing techniques to improve the segmentation and visualization. We have focused on the segmentation of mitochondria and synaptic junctions in EM stacks from the cerebral cortex, and have compared the results to those obtained by other methods. Our method is faster than other methods with similar segmentation results. Our image regularization procedure introduces high-level knowledge about the structure of labels. We have also reduced memory requirements with the introduction of energy optimization in overlapping partitions, which permits the regularization of very large image stacks. Finally, the surface smoothing step improves the appearance of three-dimensional renderings of the segmented volumes.
NASA Astrophysics Data System (ADS)
Cantrell, Jason T.
This document outlines in detail the research performed by applying shape memory polymers in a generic unimorph actuator configuration. A set of experiments designed to investigate the influence of transverse curvature, the relative widths of shape memory polymer and composite substrates, and shape memory polymer thickness on actuator recoverability after multiple thermo-mechanical cycles is presented in detail. A theoretical model of the moment required to maintain shape fixity with minimal shape retention loss was developed and experimentally validated for unimorph composite actuators of varying cross-sectional areas. Theoretical models were also developed and evaluated to determine the relationship between the materials neutral axes and thermal stability during a thermo-mechanical cycle. Research was conducted on the incorporation of shape memory polymers on micro air vehicle wings to maximize shape fixity and shape recoverability while minimizing the volume of shape memory polymer on the wing surface. Applications based research also included experimentally evaluating the feasibility of shape memory polymers on deployable satellite antenna ribs both with and without resistance heaters which could be utilized to assist in antenna deployment.
Reducing the cost of using collocation to compute vibrational energy levels: Results for CH2NH.
Avila, Gustavo; Carrington, Tucker
2017-08-14
In this paper, we improve the collocation method for computing vibrational spectra that was presented in the work of Avila and Carrington, Jr. [J. Chem. Phys. 143, 214108 (2015)]. Known quadrature and collocation methods using a Smolyak grid require storing intermediate vectors with more elements than points on the Smolyak grid. This is due to the fact that grid labels are constrained among themselves and basis labels are constrained among themselves. We show that by using the so-called hierarchical basis functions, one can significantly reduce the memory required. In this paper, the intermediate vectors have only as many elements as the Smolyak grid. The ideas are tested by computing energy levels of CH 2 NH.
NASA Technical Reports Server (NTRS)
Zhang, Jun; Ge, Lixin; Kouatchou, Jules
2000-01-01
A new fourth order compact difference scheme for the three dimensional convection diffusion equation with variable coefficients is presented. The novelty of this new difference scheme is that it Only requires 15 grid points and that it can be decoupled with two colors. The entire computational grid can be updated in two parallel subsweeps with the Gauss-Seidel type iterative method. This is compared with the known 19 point fourth order compact differenCe scheme which requires four colors to decouple the computational grid. Numerical results, with multigrid methods implemented on a shared memory parallel computer, are presented to compare the 15 point and the 19 point fourth order compact schemes.
High performance flight computer developed for deep space applications
NASA Technical Reports Server (NTRS)
Bunker, Robert L.
1993-01-01
The development of an advanced space flight computer for real time embedded deep space applications which embodies the lessons learned on Galileo and modern computer technology is described. The requirements are listed and the design implementation that meets those requirements is described. The development of SPACE-16 (Spaceborne Advanced Computing Engine) (where 16 designates the databus width) was initiated to support the MM2 (Marine Mark 2) project. The computer is based on a radiation hardened emulation of a modern 32 bit microprocessor and its family of support devices including a high performance floating point accelerator. Additional custom devices which include a coprocessor to improve input/output capabilities, a memory interface chip, and an additional support chip that provide management of all fault tolerant features, are described. Detailed supporting analyses and rationale which justifies specific design and architectural decisions are provided. The six chip types were designed and fabricated. Testing and evaluation of a brass/board was initiated.
III. NIH Toolbox Cognition Battery (CB): measuring episodic memory.
Bauer, Patricia J; Dikmen, Sureyya S; Heaton, Robert K; Mungas, Dan; Slotkin, Jerry; Beaumont, Jennifer L
2013-08-01
One of the most significant domains of cognition is episodic memory, which allows for rapid acquisition and long-term storage of new information. For purposes of the NIH Toolbox, we devised a new test of episodic memory. The nonverbal NIH Toolbox Picture Sequence Memory Test (TPSMT) requires participants to reproduce the order of an arbitrarily ordered sequence of pictures presented on a computer. To adjust for ability, sequence length varies from 6 to 15 pictures. Multiple trials are administered to increase reliability. Pediatric data from the validation study revealed the TPSMT to be sensitive to age-related changes. The task also has high test-retest reliability and promising construct validity. Steps to further increase the sensitivity of the instrument to individual and age-related variability are described. © 2013 The Society for Research in Child Development, Inc.
Silicon photonic integrated circuits with electrically programmable non-volatile memory functions.
Song, J-F; Lim, A E-J; Luo, X-S; Fang, Q; Li, C; Jia, L X; Tu, X-G; Huang, Y; Zhou, H-F; Liow, T-Y; Lo, G-Q
2016-09-19
Conventional silicon photonic integrated circuits do not normally possess memory functions, which require on-chip power in order to maintain circuit states in tuned or field-configured switching routes. In this context, we present an electrically programmable add/drop microring resonator with a wavelength shift of 426 pm between the ON/OFF states. Electrical pulses are used to control the choice of the state. Our experimental results show a wavelength shift of 2.8 pm/ms and a light intensity variation of ~0.12 dB/ms for a fixed wavelength in the OFF state. Theoretically, our device can accommodate up to 65 states of multi-level memory functions. Such memory functions can be integrated into wavelength division mutiplexing (WDM) filters and applied to optical routers and computing architectures fulfilling large data downloading demands.
Okuhata, Shiho; Kusanagi, Takuya; Kobayashi, Tetsuo
2013-10-25
The present study investigated EEG alpha activity during visual Sternberg memory tasks using two different stimulus presentation modes to elucidate how the presentation mode affected parietal alpha activity. EEGs were recorded from 10 healthy adults during the Sternberg tasks in which memory items were presented simultaneously and successively. EEG power and suppression time (ST) in the alpha band (8-13Hz) were computed for the memory maintenance and retrieval phases. The alpha activity differed according to the presentation mode during the maintenance phase but not during the retrieval phase. Results indicated that parietal alpha power recorded during the maintenance phase did not reflect the memory load alone. In contrast, ST during the retrieval phase increased with the memory load for both presentation modes, indicating a serial memory scanning process, regardless of the presentation mode. These results indicate that there was a dynamic transition in the memory process from the maintenance phase, which was sensitive to external factors, toward the retrieval phase, during which the process converged on the sequential scanning process, the Sternberg task essentially required. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Goldberg, Daniel N.; Narayanan, Sri Hari Krishna; Hascoet, Laurent; ...
2016-05-20
We apply an optimized method to the adjoint generation of a time-evolving land ice model through algorithmic differentiation (AD). The optimization involves a special treatment of the fixed-point iteration required to solve the nonlinear stress balance, which differs from a straightforward application of AD software, and leads to smaller memory requirements and in some cases shorter computation times of the adjoint. The optimization is done via implementation of the algorithm of Christianson (1994) for reverse accumulation of fixed-point problems, with the AD tool OpenAD. For test problems, the optimized adjoint is shown to have far lower memory requirements, potentially enablingmore » larger problem sizes on memory-limited machines. In the case of the land ice model, implementation of the algorithm allows further optimization by having the adjoint model solve a sequence of linear systems with identical (as opposed to varying) matrices, greatly improving performance. Finally, the methods introduced here will be of value to other efforts applying AD tools to ice models, particularly ones which solve a hybrid shallow ice/shallow shelf approximation to the Stokes equations.« less
Distinct roles of dopamine and subthalamic nucleus in learning and probabilistic decision making.
Coulthard, Elizabeth J; Bogacz, Rafal; Javed, Shazia; Mooney, Lucy K; Murphy, Gillian; Keeley, Sophie; Whone, Alan L
2012-12-01
Even simple behaviour requires us to make decisions based on combining multiple pieces of learned and new information. Making such decisions requires both learning the optimal response to each given stimulus as well as combining probabilistic information from multiple stimuli before selecting a response. Computational theories of decision making predict that learning individual stimulus-response associations and rapid combination of information from multiple stimuli are dependent on different components of basal ganglia circuitry. In particular, learning and retention of memory, required for optimal response choice, are significantly reliant on dopamine, whereas integrating information probabilistically is critically dependent upon functioning of the glutamatergic subthalamic nucleus (computing the 'normalization term' in Bayes' theorem). Here, we test these theories by investigating 22 patients with Parkinson's disease either treated with deep brain stimulation to the subthalamic nucleus and dopaminergic therapy or managed with dopaminergic therapy alone. We use computerized tasks that probe three cognitive functions-information acquisition (learning), memory over a delay and information integration when multiple pieces of sequentially presented information have to be combined. Patients performed the tasks ON or OFF deep brain stimulation and/or ON or OFF dopaminergic therapy. Consistent with the computational theories, we show that stopping dopaminergic therapy impairs memory for probabilistic information over a delay, whereas deep brain stimulation to the region of the subthalamic nucleus disrupts decision making when multiple pieces of acquired information must be combined. Furthermore, we found that when participants needed to update their decision on the basis of the last piece of information presented in the decision-making task, patients with deep brain stimulation of the subthalamic nucleus region did not slow down appropriately to revise their plan, a pattern of behaviour that mirrors the impulsivity described clinically in some patients with subthalamic nucleus deep brain stimulation. Thus, we demonstrate distinct mechanisms for two important facets of human decision making: first, a role for dopamine in memory consolidation, and second, the critical importance of the subthalamic nucleus in successful decision making when multiple pieces of information must be combined.
Multidimensional NMR inversion without Kronecker products: Multilinear inversion
NASA Astrophysics Data System (ADS)
Medellín, David; Ravi, Vivek R.; Torres-Verdín, Carlos
2016-08-01
Multidimensional NMR inversion using Kronecker products poses several challenges. First, kernel compression is only possible when the kernel matrices are separable, and in recent years, there has been an increasing interest in NMR sequences with non-separable kernels. Second, in three or more dimensions, the singular value decomposition is not unique; therefore kernel compression is not well-defined for higher dimensions. Without kernel compression, the Kronecker product yields matrices that require large amounts of memory, making the inversion intractable for personal computers. Finally, incorporating arbitrary regularization terms is not possible using the Lawson-Hanson (LH) or the Butler-Reeds-Dawson (BRD) algorithms. We develop a minimization-based inversion method that circumvents the above problems by using multilinear forms to perform multidimensional NMR inversion without using kernel compression or Kronecker products. The new method is memory efficient, requiring less than 0.1% of the memory required by the LH or BRD methods. It can also be extended to arbitrary dimensions and adapted to include non-separable kernels, linear constraints, and arbitrary regularization terms. Additionally, it is easy to implement because only a cost function and its first derivative are required to perform the inversion.
Managing internode data communications for an uninitialized process in a parallel computer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Archer, Charles J; Blocksome, Michael A; Miller, Douglas R
2014-05-20
A parallel computer includes nodes, each having main memory and a messaging unit (MU). Each MU includes computer memory, which in turn includes, MU message buffers. Each MU message buffer is associated with an uninitialized process on the compute node. In the parallel computer, managing internode data communications for an uninitialized process includes: receiving, by an MU of a compute node, one or more data communications messages in an MU message buffer associated with an uninitialized process on the compute node; determining, by an application agent, that the MU message buffer associated with the uninitialized process is full prior tomore » initialization of the uninitialized process; establishing, by the application agent, a temporary message buffer for the uninitialized process in main computer memory; and moving, by the application agent, data communications messages from the MU message buffer associated with the uninitialized process to the temporary message buffer in main computer memory.« less
Managing internode data communications for an uninitialized process in a parallel computer
Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Parker, Jeffrey J; Ratterman, Joseph D; Smith, Brian E
2014-05-20
A parallel computer includes nodes, each having main memory and a messaging unit (MU). Each MU includes computer memory, which in turn includes, MU message buffers. Each MU message buffer is associated with an uninitialized process on the compute node. In the parallel computer, managing internode data communications for an uninitialized process includes: receiving, by an MU of a compute node, one or more data communications messages in an MU message buffer associated with an uninitialized process on the compute node; determining, by an application agent, that the MU message buffer associated with the uninitialized process is full prior to initialization of the uninitialized process; establishing, by the application agent, a temporary message buffer for the uninitialized process in main computer memory; and moving, by the application agent, data communications messages from the MU message buffer associated with the uninitialized process to the temporary message buffer in main computer memory.
Piquado, Tepring; Cousins, Katheryn A Q; Wingfield, Arthur; Miller, Paul
2010-12-13
Poor hearing acuity reduces memory for spoken words, even when the words are presented with enough clarity for correct recognition. An "effortful hypothesis" suggests that the perceptual effort needed for recognition draws from resources that would otherwise be available for encoding the word in memory. To assess this hypothesis, we conducted a behavioral task requiring immediate free recall of word-lists, some of which contained an acoustically masked word that was just above perceptual threshold. Results show that masking a word reduces the recall of that word and words prior to it, as well as weakening the linking associations between the masked and prior words. In contrast, recall probabilities of words following the masked word are not affected. To account for this effect we conducted computational simulations testing two classes of models: Associative Linking Models and Short-Term Memory Buffer Models. Only a model that integrated both contextual linking and buffer components matched all of the effects of masking observed in our behavioral data. In this Linking-Buffer Model, the masked word disrupts a short-term memory buffer, causing associative links of words in the buffer to be weakened, affecting memory for the masked word and the word prior to it, while allowing links of words following the masked word to be spared. We suggest that these data account for the so-called "effortful hypothesis", where distorted input has a detrimental impact on prior information stored in short-term memory. Copyright © 2010 Elsevier B.V. All rights reserved.
Kiefer, Gundolf; Lehmann, Helko; Weese, Jürgen
2006-04-01
Maximum intensity projections (MIPs) are an important visualization technique for angiographic data sets. Efficient data inspection requires frame rates of at least five frames per second at preserved image quality. Despite the advances in computer technology, this task remains a challenge. On the one hand, the sizes of computed tomography and magnetic resonance images are increasing rapidly. On the other hand, rendering algorithms do not automatically benefit from the advances in processor technology, especially for large data sets. This is due to the faster evolving processing power and the slower evolving memory access speed, which is bridged by hierarchical cache memory architectures. In this paper, we investigate memory access optimization methods and use them for generating MIPs on general-purpose central processing units (CPUs) and graphics processing units (GPUs), respectively. These methods can work on any level of the memory hierarchy, and we show that properly combined methods can optimize memory access on multiple levels of the hierarchy at the same time. We present performance measurements to compare different algorithm variants and illustrate the influence of the respective techniques. On current hardware, the efficient handling of the memory hierarchy for CPUs improves the rendering performance by a factor of 3 to 4. On GPUs, we observed that the effect is even larger, especially for large data sets. The methods can easily be adjusted to different hardware specifics, although their impact can vary considerably. They can also be used for other rendering techniques than MIPs, and their use for more general image processing task could be investigated in the future.
Representation-Independent Iteration of Sparse Data Arrays
NASA Technical Reports Server (NTRS)
James, Mark
2007-01-01
An approach is defined that describes a method of iterating over massively large arrays containing sparse data using an approach that is implementation independent of how the contents of the sparse arrays are laid out in memory. What is unique and important here is the decoupling of the iteration over the sparse set of array elements from how they are internally represented in memory. This enables this approach to be backward compatible with existing schemes for representing sparse arrays as well as new approaches. What is novel here is a new approach for efficiently iterating over sparse arrays that is independent of the underlying memory layout representation of the array. A functional interface is defined for implementing sparse arrays in any modern programming language with a particular focus for the Chapel programming language. Examples are provided that show the translation of a loop that computes a matrix vector product into this representation for both the distributed and not-distributed cases. This work is directly applicable to NASA and its High Productivity Computing Systems (HPCS) program that JPL and our current program are engaged in. The goal of this program is to create powerful, scalable, and economically viable high-powered computer systems suitable for use in national security and industry by 2010. This is important to NASA for its computationally intensive requirements for analyzing and understanding the volumes of science data from our returned missions.
Parametric State Space Structuring
NASA Technical Reports Server (NTRS)
Ciardo, Gianfranco; Tilgner, Marco
1997-01-01
Structured approaches based on Kronecker operators for the description and solution of the infinitesimal generator of a continuous-time Markov chains are receiving increasing interest. However, their main advantage, a substantial reduction in the memory requirements during the numerical solution, comes at a price. Methods based on the "potential state space" allocate a probability vector that might be much larger than actually needed. Methods based on the "actual state space", instead, have an additional logarithmic overhead. We present an approach that realizes the advantages of both methods with none of their disadvantages, by partitioning the local state spaces of each submodel. We apply our results to a model of software rendezvous, and show how they reduce memory requirements while, at the same time, improving the efficiency of the computation.
A Succinct Naming Convention for Lengthy Hexadecimal Numbers
NASA Technical Reports Server (NTRS)
Grant, Michael S.
1997-01-01
Engineers, computer scientists, mathematicians and others must often deal with lengthy hexadecimal numbers. As memory requirements for software increase, the associated memory address space for systems necessitates the use of longer and longer strings of hexadecimal characters to describe a given number. For example, the address space of some digital signal processors (DSP's) now ranges in the billions of words, requiring eight hexadecimal characters for many of the addresses. This technical memorandum proposes a simple grouping scheme for more clearly representing lengthy hexadecimal numbers in written material, as well as a "code" for naming and more quickly verbalizing such numbers. This should facilitate communications among colleagues in engineering and related fields, and aid in comprehension and temporary memorization of important hexadecimal numbers during design work.
Parallel Calculations in LS-DYNA
NASA Astrophysics Data System (ADS)
Vartanovich Mkrtychev, Oleg; Aleksandrovich Reshetov, Andrey
2017-11-01
Nowadays, structural mechanics exhibits a trend towards numeric solutions being found for increasingly extensive and detailed tasks, which requires that capacities of computing systems be enhanced. Such enhancement can be achieved by different means. E.g., in case a computing system is represented by a workstation, its components can be replaced and/or extended (CPU, memory etc.). In essence, such modification eventually entails replacement of the entire workstation, i.e. replacement of certain components necessitates exchange of others (faster CPUs and memory devices require buses with higher throughput etc.). Special consideration must be given to the capabilities of modern video cards. They constitute powerful computing systems capable of running data processing in parallel. Interestingly, the tools originally designed to render high-performance graphics can be applied for solving problems not immediately related to graphics (CUDA, OpenCL, Shaders etc.). However, not all software suites utilize video cards’ capacities. Another way to increase capacity of a computing system is to implement a cluster architecture: to add cluster nodes (workstations) and to increase the network communication speed between the nodes. The advantage of this approach is extensive growth due to which a quite powerful system can be obtained by combining not particularly powerful nodes. Moreover, separate nodes may possess different capacities. This paper considers the use of a clustered computing system for solving problems of structural mechanics with LS-DYNA software. To establish a range of dependencies a mere 2-node cluster has proven sufficient.
A method to compute SEU fault probabilities in memory arrays with error correction
NASA Technical Reports Server (NTRS)
Gercek, Gokhan
1994-01-01
With the increasing packing densities in VLSI technology, Single Event Upsets (SEU) due to cosmic radiations are becoming more of a critical issue in the design of space avionics systems. In this paper, a method is introduced to compute the fault (mishap) probability for a computer memory of size M words. It is assumed that a Hamming code is used for each word to provide single error correction. It is also assumed that every time a memory location is read, single errors are corrected. Memory is read randomly whose distribution is assumed to be known. In such a scenario, a mishap is defined as two SEU's corrupting the same memory location prior to a read. The paper introduces a method to compute the overall mishap probability for the entire memory for a mission duration of T hours.
Triple-server blind quantum computation using entanglement swapping
NASA Astrophysics Data System (ADS)
Li, Qin; Chan, Wai Hong; Wu, Chunhui; Wen, Zhonghua
2014-04-01
Blind quantum computation allows a client who does not have enough quantum resources or technologies to achieve quantum computation on a remote quantum server such that the client's input, output, and algorithm remain unknown to the server. Up to now, single- and double-server blind quantum computation have been considered. In this work, we propose a triple-server blind computation protocol where the client can delegate quantum computation to three quantum servers by the use of entanglement swapping. Furthermore, the three quantum servers can communicate with each other and the client is almost classical since one does not require any quantum computational power, quantum memory, and the ability to prepare any quantum states and only needs to be capable of getting access to quantum channels.
A Fresh Math Perspective Opens New Possibilities for Computational Chemistry
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vu, Linda; Govind, Niranjan; Yang, Chao
2017-05-26
By reformulating the TDDFT problem as a matrix function approximation, making use of a special transformation and taking advantage of the underlying symmetry with respect to a non-Euclidean metric, Yang and his colleagues were able to apply the Lanczos algorithm and a Kernal Polynomial Method (KPM) to approximate the absorption spectrum of several molecules. Both of these algorithms require relatively low-memory compared to non-symmetrical alternatives, which is the key to the computational savings.
The computational nature of memory modification.
Gershman, Samuel J; Monfils, Marie-H; Norman, Kenneth A; Niv, Yael
2017-03-15
Retrieving a memory can modify its influence on subsequent behavior. We develop a computational theory of memory modification, according to which modification of a memory trace occurs through classical associative learning, but which memory trace is eligible for modification depends on a structure learning mechanism that discovers the units of association by segmenting the stream of experience into statistically distinct clusters (latent causes). New memories are formed when the structure learning mechanism infers that a new latent cause underlies current sensory observations. By the same token, old memories are modified when old and new sensory observations are inferred to have been generated by the same latent cause. We derive this framework from probabilistic principles, and present a computational implementation. Simulations demonstrate that our model can reproduce the major experimental findings from studies of memory modification in the Pavlovian conditioning literature.
Benchmarking Memory Performance with the Data Cube Operator
NASA Technical Reports Server (NTRS)
Frumkin, Michael A.; Shabanov, Leonid V.
2004-01-01
Data movement across a computer memory hierarchy and across computational grids is known to be a limiting factor for applications processing large data sets. We use the Data Cube Operator on an Arithmetic Data Set, called ADC, to benchmark capabilities of computers and of computational grids to handle large distributed data sets. We present a prototype implementation of a parallel algorithm for computation of the operatol: The algorithm follows a known approach for computing views from the smallest parent. The ADC stresses all levels of grid memory and storage by producing some of 2d views of an Arithmetic Data Set of d-tuples described by a small number of integers. We control data intensity of the ADC by selecting the tuple parameters, the sizes of the views, and the number of realized views. Benchmarking results of memory performance of a number of computer architectures and of a small computational grid are presented.
Towards reversible basic linear algebra subprograms: A performance study
Perumalla, Kalyan S.; Yoginath, Srikanth B.
2014-12-06
Problems such as fault tolerance and scalable synchronization can be efficiently solved using reversibility of applications. Making applications reversible by relying on computation rather than on memory is ideal for large scale parallel computing, especially for the next generation of supercomputers in which memory is expensive in terms of latency, energy, and price. In this direction, a case study is presented here in reversing a computational core, namely, Basic Linear Algebra Subprograms, which is widely used in scientific applications. A new Reversible BLAS (RBLAS) library interface has been designed, and a prototype has been implemented with two modes: (1) amore » memory-mode in which reversibility is obtained by checkpointing to memory in forward and restoring from memory in reverse, and (2) a computational-mode in which nothing is saved in the forward, but restoration is done entirely via inverse computation in reverse. The article is focused on detailed performance benchmarking to evaluate the runtime dynamics and performance effects, comparing reversible computation with checkpointing on both traditional CPU platforms and recent GPU accelerator platforms. For BLAS Level-1 subprograms, data indicates over an order of magnitude better speed of reversible computation compared to checkpointing. For BLAS Level-2 and Level-3, a more complex tradeoff is observed between reversible computation and checkpointing, depending on computational and memory complexities of the subprograms.« less
k(+)-buffer: An Efficient, Memory-Friendly and Dynamic k-buffer Framework.
Vasilakis, Andreas-Alexandros; Papaioannou, Georgios; Fudos, Ioannis
2015-06-01
Depth-sorted fragment determination is fundamental for a host of image-based techniques which simulates complex rendering effects. It is also a challenging task in terms of time and space required when rasterizing scenes with high depth complexity. When low graphics memory requirements are of utmost importance, k-buffer can objectively be considered as the most preferred framework which advantageously ensures the correct depth order on a subset of all generated fragments. Although various alternatives have been introduced to partially or completely alleviate the noticeable quality artifacts produced by the initial k-buffer algorithm in the expense of memory increase or performance downgrade, appropriate tools to automatically and dynamically compute the most suitable value of k are still missing. To this end, we introduce k(+)-buffer, a fast framework that accurately simulates the behavior of k-buffer in a single rendering pass. Two memory-bounded data structures: (i) the max-array and (ii) the max-heap are developed on the GPU to concurrently maintain the k-foremost fragments per pixel by exploring pixel synchronization and fragment culling. Memory-friendly strategies are further introduced to dynamically (a) lessen the wasteful memory allocation of individual pixels with low depth complexity frequencies, (b) minimize the allocated size of k-buffer according to different application goals and hardware limitations via a straightforward depth histogram analysis and (c) manage local GPU cache with a fixed-memory depth-sorting mechanism. Finally, an extensive experimental evaluation is provided demonstrating the advantages of our work over all prior k-buffer variants in terms of memory usage, performance cost and image quality.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blanford, M.
1997-12-31
Most commercially-available quasistatic finite element programs assemble element stiffnesses into a global stiffness matrix, then use a direct linear equation solver to obtain nodal displacements. However, for large problems (greater than a few hundred thousand degrees of freedom), the memory size and computation time required for this approach becomes prohibitive. Moreover, direct solution does not lend itself to the parallel processing needed for today`s multiprocessor systems. This talk gives an overview of the iterative solution strategy of JAS3D, the nonlinear large-deformation quasistatic finite element program. Because its architecture is derived from an explicit transient-dynamics code, it does not ever assemblemore » a global stiffness matrix. The author describes the approach he used to implement the solver on multiprocessor computers, and shows examples of problems run on hundreds of processors and more than a million degrees of freedom. Finally, he describes some of the work he is presently doing to address the challenges of iterative convergence for ill-conditioned problems.« less
Optimization Issues with Complex Rotorcraft Comprehensive Analysis
NASA Technical Reports Server (NTRS)
Walsh, Joanne L.; Young, Katherine C.; Tarzanin, Frank J.; Hirsh, Joel E.; Young, Darrell K.
1998-01-01
This paper investigates the use of the general purpose automatic differentiation (AD) tool called Automatic Differentiation of FORTRAN (ADIFOR) as a means of generating sensitivity derivatives for use in Boeing Helicopter's proprietary comprehensive rotor analysis code (VII). ADIFOR transforms an existing computer program into a new program that performs a sensitivity analysis in addition to the original analysis. In this study both the pros (exact derivatives, no step-size problems) and cons (more CPU, more memory) of ADIFOR are discussed. The size (based on the number of lines) of the VII code after ADIFOR processing increased by 70 percent and resulted in substantial computer memory requirements at execution. The ADIFOR derivatives took about 75 percent longer to compute than the finite-difference derivatives. However, the ADIFOR derivatives are exact and are not functions of step-size. The VII sensitivity derivatives generated by ADIFOR are compared with finite-difference derivatives. The ADIFOR and finite-difference derivatives are used in three optimization schemes to solve a low vibration rotor design problem.
A Linked List-Based Algorithm for Blob Detection on Embedded Vision-Based Sensors.
Acevedo-Avila, Ricardo; Gonzalez-Mendoza, Miguel; Garcia-Garcia, Andres
2016-05-28
Blob detection is a common task in vision-based applications. Most existing algorithms are aimed at execution on general purpose computers; while very few can be adapted to the computing restrictions present in embedded platforms. This paper focuses on the design of an algorithm capable of real-time blob detection that minimizes system memory consumption. The proposed algorithm detects objects in one image scan; it is based on a linked-list data structure tree used to label blobs depending on their shape and node information. An example application showing the results of a blob detection co-processor has been built on a low-powered field programmable gate array hardware as a step towards developing a smart video surveillance system. The detection method is intended for general purpose application. As such, several test cases focused on character recognition are also examined. The results obtained present a fair trade-off between accuracy and memory requirements; and prove the validity of the proposed approach for real-time implementation on resource-constrained computing platforms.
Challenges of Future High-End Computing
NASA Technical Reports Server (NTRS)
Bailey, David; Kutler, Paul (Technical Monitor)
1998-01-01
The next major milestone in high performance computing is a sustained rate of one Pflop/s (also written one petaflops, or 10(circumflex)15 floating-point operations per second). In addition to prodigiously high computational performance, such systems must of necessity feature very large main memories, as well as comparably high I/O bandwidth and huge mass storage facilities. The current consensus of scientists who have studied these issues is that "affordable" petaflops systems may be feasible by the year 2010, assuming that certain key technologies continue to progress at current rates. One important question is whether applications can be structured to perform efficiently on such systems, which are expected to incorporate many thousands of processors and deeply hierarchical memory systems. To answer these questions, advanced performance modeling techniques, including simulation of future architectures and applications, may be required. It may also be necessary to formulate "latency tolerant algorithms" and other completely new algorithmic approaches for certain applications. This talk will give an overview of these challenges.
Parallel structures in human and computer memory
NASA Astrophysics Data System (ADS)
Kanerva, Pentti
1986-08-01
If we think of our experiences as being recorded continuously on film, then human memory can be compared to a film library that is indexed by the contents of the film strips stored in it. Moreover, approximate retrieval cues suffice to retrieve information stored in this library: We recognize a familiar person in a fuzzy photograph or a familiar tune played on a strange instrument. This paper is about how to construct a computer memory that would allow a computer to recognize patterns and to recall sequences the way humans do. Such a memory is remarkably similar in structure to a conventional computer memory and also to the neural circuits in the cortex of the cerebellum of the human brain. The paper concludes that the frame problem of artificial intelligence could be solved by the use of such a memory if we were able to encode information about the world properly.
Parallel structures in human and computer memory
NASA Technical Reports Server (NTRS)
Kanerva, P.
1986-01-01
If one thinks of our experiences as being recorded continuously on film, then human memory can be compared to a film library that is indexed by the contents of the film strips stored in it. Moreover, approximate retrieval cues suffice to retrieve information stored in this library. One recognizes a familiar person in a fuzzy photograph or a familiar tune played on a strange instrument. A computer memory that would allow a computer to recognize patterns and to recall sequences the way humans do is constructed. Such a memory is remarkably similiar in structure to a conventional computer memory and also to the neural circuits in the cortex of the cerebellum of the human brain. It is concluded that the frame problem of artificial intelligence could be solved by the use of such a memory if one were able to encode information about the world properly.
Memory-efficient RNA energy landscape exploration
Mann, Martin; Kucharík, Marcel; Flamm, Christoph; Wolfinger, Michael T.
2014-01-01
Motivation: Energy landscapes provide a valuable means for studying the folding dynamics of short RNA molecules in detail by modeling all possible structures and their transitions. Higher abstraction levels based on a macro-state decomposition of the landscape enable the study of larger systems; however, they are still restricted by huge memory requirements of exact approaches. Results: We present a highly parallelizable local enumeration scheme that enables the computation of exact macro-state transition models with highly reduced memory requirements. The approach is evaluated on RNA secondary structure landscapes using a gradient basin definition for macro-states. Furthermore, we demonstrate the need for exact transition models by comparing two barrier-based approaches, and perform a detailed investigation of gradient basins in RNA energy landscapes. Availability and implementation: Source code is part of the C++ Energy Landscape Library available at http://www.bioinf.uni-freiburg.de/Software/. Contact: mmann@informatik.uni-freiburg.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24833804
shinyheatmap: Ultra fast low memory heatmap web interface for big data genomics.
Khomtchouk, Bohdan B; Hennessy, James R; Wahlestedt, Claes
2017-01-01
Transcriptomics, metabolomics, metagenomics, and other various next-generation sequencing (-omics) fields are known for their production of large datasets, especially across single-cell sequencing studies. Visualizing such big data has posed technical challenges in biology, both in terms of available computational resources as well as programming acumen. Since heatmaps are used to depict high-dimensional numerical data as a colored grid of cells, efficiency and speed have often proven to be critical considerations in the process of successfully converting data into graphics. For example, rendering interactive heatmaps from large input datasets (e.g., 100k+ rows) has been computationally infeasible on both desktop computers and web browsers. In addition to memory requirements, programming skills and knowledge have frequently been barriers-to-entry for creating highly customizable heatmaps. We propose shinyheatmap: an advanced user-friendly heatmap software suite capable of efficiently creating highly customizable static and interactive biological heatmaps in a web browser. shinyheatmap is a low memory footprint program, making it particularly well-suited for the interactive visualization of extremely large datasets that cannot typically be computed in-memory due to size restrictions. Also, shinyheatmap features a built-in high performance web plug-in, fastheatmap, for rapidly plotting interactive heatmaps of datasets as large as 105-107 rows within seconds, effectively shattering previous performance benchmarks of heatmap rendering speed. shinyheatmap is hosted online as a freely available web server with an intuitive graphical user interface: http://shinyheatmap.com. The methods are implemented in R, and are available as part of the shinyheatmap project at: https://github.com/Bohdan-Khomtchouk/shinyheatmap. Users can access fastheatmap directly from within the shinyheatmap web interface, and all source code has been made publicly available on Github: https://github.com/Bohdan-Khomtchouk/fastheatmap.
To Switch or Not to Switch: Role of Cognitive Control in Working Memory Training in Older Adults.
Basak, Chandramallika; O'Connell, Margaret A
2016-01-01
It is currently not known what are the best working memory training strategies to offset the age-related declines in fluid cognitive abilities. In this randomized clinical double-blind trial, older adults were randomly assigned to one of two types of working memory training - one group was trained on a predictable memory updating task (PT) and another group was trained on a novel, unpredictable memory updating task (UT). Unpredictable memory updating, compared to predictable, requires greater demands on cognitive control (Basak and Verhaeghen, 2011a). Therefore, the current study allowed us to evaluate the role of cognitive control in working memory training. All participants were assessed on a set of near and far transfer tasks at three different testing sessions - before training, immediately after the training, and 1.5 months after completing the training. Additionally, individual learning rates for a comparison working memory task (performed by both groups) and the trained task were computed. Training on unpredictable memory updating, compared to predictable, significantly enhanced performance on a measure of episodic memory, immediately after the training. Moreover, individuals with faster learning rates showed greater gains in this episodic memory task and another new working memory task; this effect was specific to UT. We propose that the unpredictable memory updating training, compared to predictable memory updating training, may a better strategy to improve selective cognitive abilities in older adults, and future studies could further investigate the role of cognitive control in working memory training.
Load Balancing Strategies for Multiphase Flows on Structured Grids
NASA Astrophysics Data System (ADS)
Olshefski, Kristopher; Owkes, Mark
2017-11-01
The computation time required to perform large simulations of complex systems is currently one of the leading bottlenecks of computational research. Parallelization allows multiple processing cores to perform calculations simultaneously and reduces computational times. However, load imbalances between processors waste computing resources as processors wait for others to complete imbalanced tasks. In multiphase flows, these imbalances arise due to the additional computational effort required at the gas-liquid interface. However, many current load balancing schemes are only designed for unstructured grid applications. The purpose of this research is to develop a load balancing strategy while maintaining the simplicity of a structured grid. Several approaches are investigated including brute force oversubscription, node oversubscription through Message Passing Interface (MPI) commands, and shared memory load balancing using OpenMP. Each of these strategies are tested with a simple one-dimensional model prior to implementation into the three-dimensional NGA code. Current results show load balancing will reduce computational time by at least 30%.
Flash drive memory apparatus and method
NASA Technical Reports Server (NTRS)
Hinchey, Michael G. (Inventor)
2010-01-01
A memory apparatus includes a non-volatile computer memory, a USB mass storage controller connected to the non-volatile computer memory, the USB mass storage controller including a daisy chain component, a male USB interface connected to the USB mass storage controller, and at least one other interface for a memory device, other than a USB interface, the at least one other interface being connected to the USB mass storage controller.
The computational nature of memory modification
Gershman, Samuel J; Monfils, Marie-H; Norman, Kenneth A; Niv, Yael
2017-01-01
Retrieving a memory can modify its influence on subsequent behavior. We develop a computational theory of memory modification, according to which modification of a memory trace occurs through classical associative learning, but which memory trace is eligible for modification depends on a structure learning mechanism that discovers the units of association by segmenting the stream of experience into statistically distinct clusters (latent causes). New memories are formed when the structure learning mechanism infers that a new latent cause underlies current sensory observations. By the same token, old memories are modified when old and new sensory observations are inferred to have been generated by the same latent cause. We derive this framework from probabilistic principles, and present a computational implementation. Simulations demonstrate that our model can reproduce the major experimental findings from studies of memory modification in the Pavlovian conditioning literature. DOI: http://dx.doi.org/10.7554/eLife.23763.001 PMID:28294944
Method of up-front load balancing for local memory parallel processors
NASA Technical Reports Server (NTRS)
Baffes, Paul Thomas (Inventor)
1990-01-01
In a parallel processing computer system with multiple processing units and shared memory, a method is disclosed for uniformly balancing the aggregate computational load in, and utilizing minimal memory by, a network having identical computations to be executed at each connection therein. Read-only and read-write memory are subdivided into a plurality of process sets, which function like artificial processing units. Said plurality of process sets is iteratively merged and reduced to the number of processing units without exceeding the balance load. Said merger is based upon the value of a partition threshold, which is a measure of the memory utilization. The turnaround time and memory savings of the instant method are functions of the number of processing units available and the number of partitions into which the memory is subdivided. Typical results of the preferred embodiment yielded memory savings of from sixty to seventy five percent.
Conditional load and store in a shared memory
Blumrich, Matthias A; Ohmacht, Martin
2015-02-03
A method, system and computer program product for implementing load-reserve and store-conditional instructions in a multi-processor computing system. The computing system includes a multitude of processor units and a shared memory cache, and each of the processor units has access to the memory cache. In one embodiment, the method comprises providing the memory cache with a series of reservation registers, and storing in these registers addresses reserved in the memory cache for the processor units as a result of issuing load-reserve requests. In this embodiment, when one of the processor units makes a request to store data in the memory cache using a store-conditional request, the reservation registers are checked to determine if an address in the memory cache is reserved for that processor unit. If an address in the memory cache is reserved for that processor, the data are stored at this address.
McCormick, Cornelia; Protzner, Andrea B.; Barnett, Alexander J.; Cohn, Melanie; Valiante, Taufik A.; McAndrews, Mary Pat
2014-01-01
Computational models predict that focal damage to the Default Mode Network (DMN) causes widespread decreases and increases of functional DMN connectivity. How such alterations impact functioning in a specific cognitive domain such as episodic memory remains relatively unexplored. Here, we show in patients with unilateral medial temporal lobe epilepsy (mTLE) that focal structural damage leads indeed to specific patterns of DMN functional connectivity alterations, specifically decreased connectivity between both medial temporal lobes (MTLs) and the posterior part of the DMN and increased intrahemispheric anterior–posterior connectivity. Importantly, these patterns were associated with better and worse episodic memory capacity, respectively. These distinct patterns, shown here for the first time, suggest that a close dialogue between both MTLs and the posterior components of the DMN is required to fully express the extensive repertoire of episodic memory abilities. PMID:25068108
Individual differences in working memory capacity and workload capacity.
Yu, Ju-Chi; Chang, Ting-Yun; Yang, Cheng-Ta
2014-01-01
We investigated the relationship between working memory capacity (WMC) and workload capacity (WLC). Each participant performed an operation span (OSPAN) task to measure his/her WMC and three redundant-target detection tasks to measure his/her WLC. WLC was computed non-parametrically (Experiments 1 and 2) and parametrically (Experiment 2). Both levels of analyses showed that participants high in WMC had larger WLC than those low in WMC only when redundant information came from visual and auditory modalities, suggesting that high-WMC participants had superior processing capacity in dealing with redundant visual and auditory information. This difference was eliminated when multiple processes required processing for only a single working memory subsystem in a color-shape detection task and a double-dot detection task. These results highlighted the role of executive control in integrating and binding information from the two working memory subsystems for perceptual decision making.
Reed Solomon codes for error control in byte organized computer memory systems
NASA Technical Reports Server (NTRS)
Lin, S.; Costello, D. J., Jr.
1984-01-01
A problem in designing semiconductor memories is to provide some measure of error control without requiring excessive coding overhead or decoding time. In LSI and VLSI technology, memories are often organized on a multiple bit (or byte) per chip basis. For example, some 256K-bit DRAM's are organized in 32Kx8 bit-bytes. Byte oriented codes such as Reed Solomon (RS) codes can provide efficient low overhead error control for such memories. However, the standard iterative algorithm for decoding RS codes is too slow for these applications. Some special decoding techniques for extended single-and-double-error-correcting RS codes which are capable of high speed operation are presented. These techniques are designed to find the error locations and the error values directly from the syndrome without having to use the iterative algorithm to find the error locator polynomial.
Fast associative memory + slow neural circuitry = the computational model of the brain.
NASA Astrophysics Data System (ADS)
Berkovich, Simon; Berkovich, Efraim; Lapir, Gennady
1997-08-01
We propose a computational model of the brain based on a fast associative memory and relatively slow neural processors. In this model, processing time is expensive but memory access is not, and therefore most algorithmic tasks would be accomplished by using large look-up tables as opposed to calculating. The essential feature of an associative memory in this context (characteristic for a holographic type memory) is that it works without an explicit mechanism for resolution of multiple responses. As a result, the slow neuronal processing elements, overwhelmed by the flow of information, operate as a set of templates for ranking of the retrieved information. This structure addresses the primary controversy in the brain architecture: distributed organization of memory vs. localization of processing centers. This computational model offers an intriguing explanation of many of the paradoxical features in the brain architecture, such as integration of sensors (through DMA mechanism), subliminal perception, universality of software, interrupts, fault-tolerance, certain bizarre possibilities for rapid arithmetics etc. In conventional computer science the presented type of a computational model did not attract attention as it goes against the technological grain by using a working memory faster than processing elements.
Hopfield, J J
2008-05-01
The algorithms that simple feedback neural circuits representing a brain area can rapidly carry out are often adequate to solve easy problems but for more difficult problems can return incorrect answers. A new excitatory-inhibitory circuit model of associative memory displays the common human problem of failing to rapidly find a memory when only a small clue is present. The memory model and a related computational network for solving Sudoku puzzles produce answers that contain implicit check bits in the representation of information across neurons, allowing a rapid evaluation of whether the putative answer is correct or incorrect through a computation related to visual pop-out. This fact may account for our strong psychological feeling of right or wrong when we retrieve a nominal memory from a minimal clue. This information allows more difficult computations or memory retrievals to be done in a serial fashion by using the fast but limited capabilities of a computational module multiple times. The mathematics of the excitatory-inhibitory circuits for associative memory and for Sudoku, both of which are understood in terms of energy or Lyapunov functions, is described in detail.
Neural Representations of Location Outside the Hippocampus
ERIC Educational Resources Information Center
Knierim, James J.
2006-01-01
Place cells of the rat hippocampus are a dominant model system for understanding the role of the hippocampus in learning and memory at the level of single-unit and neural ensemble responses. A complete understanding of the information processing and computations performed by the hippocampus requires detailed knowledge about the properties of the…
ERIC Educational Resources Information Center
Durrett, John; Trezona, Judi
1982-01-01
Discusses physiological and psychological aspects of color. Includes guidelines for using color effectively, especially in the development of computer programs. Indicates that if applied with its limitations and requirements in mind, color can be a powerful manipulator of attention, memory, and understanding. (Author/JN)
High-Dimensional Semantic Space Accounts of Priming
ERIC Educational Resources Information Center
Jones, Michael N.; Kintsch, Walter; Mewhort, Douglas J. K.
2006-01-01
A broad range of priming data has been used to explore the structure of semantic memory and to test between models of word representation. In this paper, we examine the computational mechanisms required to learn distributed semantic representations for words directly from unsupervised experience with language. To best account for the variety of…
Multiprocessing MCNP on an IBN RS/6000 cluster
DOE Office of Scientific and Technical Information (OSTI.GOV)
McKinney, G.W.; West, J.T.
1993-01-01
The advent of high-performance computer systems has brought to maturity programming concepts like vectorization, multiprocessing, and multitasking. While there are many schools of thought as to the most significant factor in obtaining order-of-magnitude increases in performance, such speedup can only be achieved by integrating the computer system and application code. Vectorization leads to faster manipulation of arrays by overlapping instruction CPU cycles. Discrete ordinates codes, which require the solving of large matrices, have proved to be major benefactors of vectorization. Monte Carlo transport, on the other hand, typically contains numerous logic statements and requires extensive redevelopment to benefit from vectorization.more » Multiprocessing and multitasking provide additional CPU cycles via multiple processors. Such systems are generally designed with either common memory access (multitasking) or distributed memory access. In both cases, theoretical speedup, as a function of the number of processors P and the fraction f of task time that multiprocesses, can be formulated using Amdahl's law: S(f, P) =1/(1-f+f/P). However, for most applications, this theoretical limit cannot be achieved because of additional terms (e.g., multitasking overhead, memory overlap, etc.) that are not included in Amdahl's law. Monte Carlo transport is a natural candidate for multiprocessing because the particle tracks are generally independent, and the precision of the result increases as the square Foot of the number of particles tracked.« less
Memory conformity affects inaccurate memories more than accurate memories.
Wright, Daniel B; Villalba, Daniella K
2012-01-01
After controlling for initial confidence, inaccurate memories were shown to be more easily distorted than accurate memories. In two experiments groups of participants viewed 50 stimuli and were then presented with these stimuli plus 50 fillers. During this test phase participants reported their confidence that each stimulus was originally shown. This was followed by computer-generated responses from a bogus participant. After being exposed to this response participants again rated the confidence of their memory. The computer-generated responses systematically distorted participants' responses. Memory distortion depended on initial memory confidence, with uncertain memories being more malleable than confident memories. This effect was moderated by whether the participant's memory was initially accurate or inaccurate. Inaccurate memories were more malleable than accurate memories. The data were consistent with a model describing two types of memory (i.e., recollective and non-recollective memories), which differ in how susceptible these memories are to memory distortion.
Computers, the Human Mind, and My In-Laws' House.
ERIC Educational Resources Information Center
Esque, Timm J.
1996-01-01
Discussion of human memory, computer memory, and the storage of information focuses on a metaphor that can account for memory without storage and can set the stage for systemic research around a more comprehensive, understandable theory. (Author/LRW)
Xia, Fei; Jin, Guoqing
2014-06-01
PKNOTS is a most famous benchmark program and has been widely used to predict RNA secondary structure including pseudoknots. It adopts the standard four-dimensional (4D) dynamic programming (DP) method and is the basis of many variants and improved algorithms. Unfortunately, the O(N(6)) computing requirements and complicated data dependency greatly limits the usefulness of PKNOTS package with the explosion in gene database size. In this paper, we present a fine-grained parallel PKNOTS package and prototype system for accelerating RNA folding application based on FPGA chip. We adopted a series of storage optimization strategies to resolve the "Memory Wall" problem. We aggressively exploit parallel computing strategies to improve computational efficiency. We also propose several methods that collectively reduce the storage requirements for FPGA on-chip memory. To the best of our knowledge, our design is the first FPGA implementation for accelerating 4D DP problem for RNA folding application including pseudoknots. The experimental results show a factor of more than 50x average speedup over the PKNOTS-1.08 software running on a PC platform with Intel Core2 Q9400 Quad CPU for input RNA sequences. However, the power consumption of our FPGA accelerator is only about 50% of the general-purpose micro-processors.
NASA Technical Reports Server (NTRS)
Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas
2008-01-01
A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.
Cheaper Adjoints by Reversing Address Computations
Hascoët, L.; Utke, J.; Naumann, U.
2008-01-01
The reverse mode of automatic differentiation is widely used in science and engineering. A severe bottleneck for the performance of the reverse mode, however, is the necessity to recover certain intermediate values of the program in reverse order. Among these values are computed addresses, which traditionally are recovered through forward recomputation and storage in memory. We propose an alternative approach for recovery that uses inverse computation based on dependency information. Address storage constitutes a significant portion of the overall storage requirements. An example illustrates substantial gains that the proposed approach yields, and we show use cases in practical applications.
NASA Technical Reports Server (NTRS)
Muellerschoen, R. J.
1988-01-01
A unified method to permute vector stored Upper triangular Diagonal factorized covariance and vector stored upper triangular Square Root Information arrays is presented. The method involves cyclic permutation of the rows and columns of the arrays and retriangularization with fast (slow) Givens rotations (reflections). Minimal computation is performed, and a one dimensional scratch array is required. To make the method efficient for large arrays on a virtual memory machine, computations are arranged so as to avoid expensive paging faults. This method is potentially important for processing large volumes of radio metric data in the Deep Space Network.
A Component-Based FPGA Design Framework for Neuronal Ion Channel Dynamics Simulations
Mak, Terrence S. T.; Rachmuth, Guy; Lam, Kai-Pui; Poon, Chi-Sang
2008-01-01
Neuron-machine interfaces such as dynamic clamp and brain-implantable neuroprosthetic devices require real-time simulations of neuronal ion channel dynamics. Field Programmable Gate Array (FPGA) has emerged as a high-speed digital platform ideal for such application-specific computations. We propose an efficient and flexible component-based FPGA design framework for neuronal ion channel dynamics simulations, which overcomes certain limitations of the recently proposed memory-based approach. A parallel processing strategy is used to minimize computational delay, and a hardware-efficient factoring approach for calculating exponential and division functions in neuronal ion channel models is used to conserve resource consumption. Performances of the various FPGA design approaches are compared theoretically and experimentally in corresponding implementations of the AMPA and NMDA synaptic ion channel models. Our results suggest that the component-based design framework provides a more memory economic solution as well as more efficient logic utilization for large word lengths, whereas the memory-based approach may be suitable for time-critical applications where a higher throughput rate is desired. PMID:17190033
Interpolation Approach To Computer-Generated Holograms
NASA Astrophysics Data System (ADS)
Yatagai, Toyohiko
1983-10-01
A computer-generated hologram (CGH) for reconstructing independent NxN resolution points would actually require a hologram made up of NxN sampling cells. For dependent sampling points of Fourier transform CGHs, the required memory size for computation by using an interpolation technique for reconstructed image points can be reduced. We have made a mosaic hologram which consists of K x K subholograms with N x N sampling points multiplied by an appropriate weighting factor. It is shown that the mosaic hologram can reconstruct an image with NK x NK resolution points. The main advantage of the present algorithm is that a sufficiently large size hologram of NK x NK sample points is synthesized by K x K subholograms which are successively calculated from the data of N x N sample points and also successively plotted.
Exploring the use of I/O nodes for computation in a MIMD multiprocessor
NASA Technical Reports Server (NTRS)
Kotz, David; Cai, Ting
1995-01-01
As parallel systems move into the production scientific-computing world, the emphasis will be on cost-effective solutions that provide high throughput for a mix of applications. Cost effective solutions demand that a system make effective use of all of its resources. Many MIMD multiprocessors today, however, distinguish between 'compute' and 'I/O' nodes, the latter having attached disks and being dedicated to running the file-system server. This static division of responsibilities simplifies system management but does not necessarily lead to the best performance in workloads that need a different balance of computation and I/O. Of course, computational processes sharing a node with a file-system service may receive less CPU time, network bandwidth, and memory bandwidth than they would on a computation-only node. In this paper we begin to examine this issue experimentally. We found that high performance I/O does not necessarily require substantial CPU time, leaving plenty of time for application computation. There were some complex file-system requests, however, which left little CPU time available to the application. (The impact on network and memory bandwidth still needs to be determined.) For applications (or users) that cannot tolerate an occasional interruption, we recommend that they continue to use only compute nodes. For tolerant applications needing more cycles than those provided by the compute nodes, we recommend that they take full advantage of both compute and I/O nodes for computation, and that operating systems should make this possible.
Fluid/Structure Interaction Studies of Aircraft Using High Fidelity Equations on Parallel Computers
NASA Technical Reports Server (NTRS)
Guruswamy, Guru; VanDalsem, William (Technical Monitor)
1994-01-01
Abstract Aeroelasticity which involves strong coupling of fluids, structures and controls is an important element in designing an aircraft. Computational aeroelasticity using low fidelity methods such as the linear aerodynamic flow equations coupled with the modal structural equations are well advanced. Though these low fidelity approaches are computationally less intensive, they are not adequate for the analysis of modern aircraft such as High Speed Civil Transport (HSCT) and Advanced Subsonic Transport (AST) which can experience complex flow/structure interactions. HSCT can experience vortex induced aeroelastic oscillations whereas AST can experience transonic buffet associated structural oscillations. Both aircraft may experience a dip in the flutter speed at the transonic regime. For accurate aeroelastic computations at these complex fluid/structure interaction situations, high fidelity equations such as the Navier-Stokes for fluids and the finite-elements for structures are needed. Computations using these high fidelity equations require large computational resources both in memory and speed. Current conventional super computers have reached their limitations both in memory and speed. As a result, parallel computers have evolved to overcome the limitations of conventional computers. This paper will address the transition that is taking place in computational aeroelasticity from conventional computers to parallel computers. The paper will address special techniques needed to take advantage of the architecture of new parallel computers. Results will be illustrated from computations made on iPSC/860 and IBM SP2 computer by using ENSAERO code that directly couples the Euler/Navier-Stokes flow equations with high resolution finite-element structural equations.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hull, L.C.
The Prickett and Lonnquist two-dimensional groundwater model has been programmed for the Apple II minicomputer. Both leaky and nonleaky confined aquifers can be simulated. The model was adapted from the FORTRAN version of Prickett and Lonnquist. In the configuration presented here, the program requires 64 K bits of memory. Because of the large number of arrays used in the program, and memory limitations of the Apple II, the maximum grid size that can be used is 20 rows by 20 columns. Input to the program is interactive, with prompting by the computer. Output consists of predicted lead values at themore » row-column intersections (nodes).« less
Turbulence modeling of free shear layers for high performance aircraft
NASA Technical Reports Server (NTRS)
Sondak, Douglas
1993-01-01
In many flowfield computations, accuracy of the turbulence model employed is frequently a limiting factor in the overall accuracy of the computation. This is particularly true for complex flowfields such as those around full aircraft configurations. Free shear layers such as wakes, impinging jets (in V/STOL applications), and mixing layers over cavities are often part of these flowfields. Although flowfields have been computed for full aircraft, the memory and CPU requirements for these computations are often excessive. Additional computer power is required for multidisciplinary computations such as coupled fluid dynamics and conduction heat transfer analysis. Massively parallel computers show promise in alleviating this situation, and the purpose of this effort was to adapt and optimize CFD codes to these new machines. The objective of this research effort was to compute the flowfield and heat transfer for a two-dimensional jet impinging normally on a cool plate. The results of this research effort were summarized in an AIAA paper titled 'Parallel Implementation of the k-epsilon Turbulence Model'. Appendix A contains the full paper.
Parallel Simulation of Unsteady Turbulent Flames
NASA Technical Reports Server (NTRS)
Menon, Suresh
1996-01-01
Time-accurate simulation of turbulent flames in high Reynolds number flows is a challenging task since both fluid dynamics and combustion must be modeled accurately. To numerically simulate this phenomenon, very large computer resources (both time and memory) are required. Although current vector supercomputers are capable of providing adequate resources for simulations of this nature, the high cost and their limited availability, makes practical use of such machines less than satisfactory. At the same time, the explicit time integration algorithms used in unsteady flow simulations often possess a very high degree of parallelism, making them very amenable to efficient implementation on large-scale parallel computers. Under these circumstances, distributed memory parallel computers offer an excellent near-term solution for greatly increased computational speed and memory, at a cost that may render the unsteady simulations of the type discussed above more feasible and affordable.This paper discusses the study of unsteady turbulent flames using a simulation algorithm that is capable of retaining high parallel efficiency on distributed memory parallel architectures. Numerical studies are carried out using large-eddy simulation (LES). In LES, the scales larger than the grid are computed using a time- and space-accurate scheme, while the unresolved small scales are modeled using eddy viscosity based subgrid models. This is acceptable for the moment/energy closure since the small scales primarily provide a dissipative mechanism for the energy transferred from the large scales. However, for combustion to occur, the species must first undergo mixing at the small scales and then come into molecular contact. Therefore, global models cannot be used. Recently, a new model for turbulent combustion was developed, in which the combustion is modeled, within the subgrid (small-scales) using a methodology that simulates the mixing and the molecular transport and the chemical kinetics within each LES grid cell. Finite-rate kinetics can be included without any closure and this approach actually provides a means to predict the turbulent rates and the turbulent flame speed. The subgrid combustion model requires resolution of the local time scales associated with small-scale mixing, molecular diffusion and chemical kinetics and, therefore, within each grid cell, a significant amount of computations must be carried out before the large-scale (LES resolved) effects are incorporated. Therefore, this approach is uniquely suited for parallel processing and has been implemented on various systems such as: Intel Paragon, IBM SP-2, Cray T3D and SGI Power Challenge (PC) using the system independent Message Passing Interface (MPI) compiler. In this paper, timing data on these machines is reported along with some characteristic results.
NASA Astrophysics Data System (ADS)
Luo, Ye; Esler, Kenneth; Kent, Paul; Shulenburger, Luke
Quantum Monte Carlo (QMC) calculations of giant molecules, surface and defect properties of solids have been feasible recently due to drastically expanding computational resources. However, with the most computationally efficient basis set, B-splines, these calculations are severely restricted by the memory capacity of compute nodes. The B-spline coefficients are shared on a node but not distributed among nodes, to ensure fast evaluation. A hybrid representation which incorporates atomic orbitals near the ions and B-spline ones in the interstitial regions offers a more accurate and less memory demanding description of the orbitals because they are naturally more atomic like near ions and much smoother in between, thus allowing coarser B-spline grids. We will demonstrate the advantage of hybrid representation over pure B-spline and Gaussian basis sets and also show significant speed-up like computing the non-local pseudopotentials with our new scheme. Moreover, we discuss a new algorithm for atomic orbital initialization which used to require an extra workflow step taking a few days. With this work, the highly efficient hybrid representation paves the way to simulate large size even in-homogeneous systems using QMC. This work was supported by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Computational Materials Sciences Program.
Limited-memory adaptive snapshot selection for proper orthogonal decomposition
DOE Office of Scientific and Technical Information (OSTI.GOV)
Oxberry, Geoffrey M.; Kostova-Vassilevska, Tanya; Arrighi, Bill
2015-04-02
Reduced order models are useful for accelerating simulations in many-query contexts, such as optimization, uncertainty quantification, and sensitivity analysis. However, offline training of reduced order models can have prohibitively expensive memory and floating-point operation costs in high-performance computing applications, where memory per core is limited. To overcome this limitation for proper orthogonal decomposition, we propose a novel adaptive selection method for snapshots in time that limits offline training costs by selecting snapshots according an error control mechanism similar to that found in adaptive time-stepping ordinary differential equation solvers. The error estimator used in this work is related to theory boundingmore » the approximation error in time of proper orthogonal decomposition-based reduced order models, and memory usage is minimized by computing the singular value decomposition using a single-pass incremental algorithm. Results for a viscous Burgers’ test problem demonstrate convergence in the limit as the algorithm error tolerances go to zero; in this limit, the full order model is recovered to within discretization error. The resulting method can be used on supercomputers to generate proper orthogonal decomposition-based reduced order models, or as a subroutine within hyperreduction algorithms that require taking snapshots in time, or within greedy algorithms for sampling parameter space.« less
An efficient photogrammetric stereo matching method for high-resolution images
NASA Astrophysics Data System (ADS)
Li, Yingsong; Zheng, Shunyi; Wang, Xiaonan; Ma, Hao
2016-12-01
Stereo matching of high-resolution images is a great challenge in photogrammetry. The main difficulty is the enormous processing workload that involves substantial computing time and memory consumption. In recent years, the semi-global matching (SGM) method has been a promising approach for solving stereo problems in different data sets. However, the time complexity and memory demand of SGM are proportional to the scale of the images involved, which leads to very high consumption when dealing with large images. To solve it, this paper presents an efficient hierarchical matching strategy based on the SGM algorithm using single instruction multiple data instructions and structured parallelism in the central processing unit. The proposed method can significantly reduce the computational time and memory required for large scale stereo matching. The three-dimensional (3D) surface is reconstructed by triangulating and fusing redundant reconstruction information from multi-view matching results. Finally, three high-resolution aerial date sets are used to evaluate our improvement. Furthermore, precise airborne laser scanner data of one data set is used to measure the accuracy of our reconstruction. Experimental results demonstrate that our method remarkably outperforms in terms of time and memory savings while maintaining the density and precision of the 3D cloud points derived.
GPU and APU computations of Finite Time Lyapunov Exponent fields
NASA Astrophysics Data System (ADS)
Conti, Christian; Rossinelli, Diego; Koumoutsakos, Petros
2012-03-01
We present GPU and APU accelerated computations of Finite-Time Lyapunov Exponent (FTLE) fields. The calculation of FTLEs is a computationally intensive process, as in order to obtain the sharp ridges associated with the Lagrangian Coherent Structures an extensive resampling of the flow field is required. The computational performance of this resampling is limited by the memory bandwidth of the underlying computer architecture. The present technique harnesses data-parallel execution of many-core architectures and relies on fast and accurate evaluations of moment conserving functions for the mesh to particle interpolations. We demonstrate how the computation of FTLEs can be efficiently performed on a GPU and on an APU through OpenCL and we report over one order of magnitude improvements over multi-threaded executions in FTLE computations of bluff body flows.
Computational needs survey of NASA automation and robotics missions. Volume 1: Survey and results
NASA Technical Reports Server (NTRS)
Davis, Gloria J.
1991-01-01
NASA's operational use of advanced processor technology in space systems lags behind its commercial development by more than eight years. One of the factors contributing to this is that mission computing requirements are frequently unknown, unstated, misrepresented, or simply not available in a timely manner. NASA must provide clear common requirements to make better use of available technology, to cut development lead time on deployable architectures, and to increase the utilization of new technology. A preliminary set of advanced mission computational processing requirements of automation and robotics (A&R) systems are provided for use by NASA, industry, and academic communities. These results were obtained in an assessment of the computational needs of current projects throughout NASA. The high percent of responses indicated a general need for enhanced computational capabilities beyond the currently available 80386 and 68020 processor technology. Because of the need for faster processors and more memory, 90 percent of the polled automation projects have reduced or will reduce the scope of their implementation capabilities. The requirements are presented with respect to their targeted environment, identifying the applications required, system performance levels necessary to support them, and the degree to which they are met with typical programmatic constraints. Volume one includes the survey and results. Volume two contains the appendixes.
Computational needs survey of NASA automation and robotics missions. Volume 2: Appendixes
NASA Technical Reports Server (NTRS)
Davis, Gloria J.
1991-01-01
NASA's operational use of advanced processor technology in space systems lags behind its commercial development by more than eight years. One of the factors contributing to this is the fact that mission computing requirements are frequency unknown, unstated, misrepresented, or simply not available in a timely manner. NASA must provide clear common requirements to make better use of available technology, to cut development lead time on deployable architectures, and to increase the utilization of new technology. Here, NASA, industry and academic communities are provided with a preliminary set of advanced mission computational processing requirements of automation and robotics (A and R) systems. The results were obtained in an assessment of the computational needs of current projects throughout NASA. The high percent of responses indicated a general need for enhanced computational capabilities beyond the currently available 80386 and 68020 processor technology. Because of the need for faster processors and more memory, 90 percent of the polled automation projects have reduced or will reduce the scope of their implemented capabilities. The requirements are presented with respect to their targeted environment, identifying the applications required, system performance levels necessary to support them, and the degree to which they are met with typical programmatic constraints. Here, appendixes are provided.
NASA Technical Reports Server (NTRS)
Reuther, James; Alonso, Juan Jose; Rimlinger, Mark J.; Jameson, Antony
1996-01-01
This work describes the application of a control theory-based aerodynamic shape optimization method to the problem of supersonic aircraft design. The design process is greatly accelerated through the use of both control theory and a parallel implementation on distributed memory computers. Control theory is employed to derive the adjoint differential equations whose solution allows for the evaluation of design gradient information at a fraction of the computational cost required by previous design methods. The resulting problem is then implemented on parallel distributed memory architectures using a domain decomposition approach, an optimized communication schedule, and the MPI (Message Passing Interface) Standard for portability and efficiency. The final result achieves very rapid aerodynamic design based on higher order computational fluid dynamics methods (CFD). In our earlier studies, the serial implementation of this design method was shown to be effective for the optimization of airfoils, wings, wing-bodies, and complex aircraft configurations using both the potential equation and the Euler equations. In our most recent paper, the Euler method was extended to treat complete aircraft configurations via a new multiblock implementation. Furthermore, during the same conference, we also presented preliminary results demonstrating that this basic methodology could be ported to distributed memory parallel computing architectures. In this paper, our concern will be to demonstrate that the combined power of these new technologies can be used routinely in an industrial design environment by applying it to the case study of the design of typical supersonic transport configurations. A particular difficulty of this test case is posed by the propulsion/airframe integration.
Ziraldo, Cordelia; Gong, Chang; Kirschner, Denise E.; ...
2016-01-06
Lack of an effective vaccine results in 9 million new cases of tuberculosis (TB) every year and 1.8 million deaths worldwide. While many infants are vaccinated at birth with BCG (an attenuated M. bovis), this does not prevent infection or development of TB after childhood. Immune responses necessary for prevention of infection or disease are still unknown, making development of effective vaccines against TB challenging. Several new vaccines are ready for human clinical trials, but these trials are difficult and expensive; especially challenging is determining the appropriate cellular response necessary for protection. The magnitude of an immune response is likelymore » key to generating a successful vaccine. Characteristics such as numbers of central memory (CM) and effector memory (EM) T cells responsive to a diverse set of epitopes are also correlated with protection. Promising vaccines against TB contain mycobacterial subunit antigens (Ag) present during both active and latent infection. We hypothesize that protection against different key immunodominant antigens could require a vaccine that produces different levels of EM and CM for each Ag-specific memory population. We created a computational model to explore EM and CM values, and their ratio, within what we term Memory Design Space. Our model captures events involved in T cell priming within lymph nodes and tracks their circulation through blood to peripheral tissues. We used the model to test whether multiple Ag-specific memory cell populations could be generated with distinct locations within Memory Design Space at a specific time point post vaccination. Boosting can further shift memory populations to memory cell ratios unreachable by initial priming events. By strategically varying antigen load, properties of cellular interactions within the LN, and delivery parameters (e.g., number of boosts) of multi-subunit vaccines, we can generate multiple Ag-specific memory populations that cover a wide range of Memory Design Space. As a result, given a set of desired characteristics for Ag-specific memory populations, we can use our model as a tool to predict vaccine formulations that will generate those populations.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ziraldo, Cordelia; Gong, Chang; Kirschner, Denise E.
Lack of an effective vaccine results in 9 million new cases of tuberculosis (TB) every year and 1.8 million deaths worldwide. While many infants are vaccinated at birth with BCG (an attenuated M. bovis), this does not prevent infection or development of TB after childhood. Immune responses necessary for prevention of infection or disease are still unknown, making development of effective vaccines against TB challenging. Several new vaccines are ready for human clinical trials, but these trials are difficult and expensive; especially challenging is determining the appropriate cellular response necessary for protection. The magnitude of an immune response is likelymore » key to generating a successful vaccine. Characteristics such as numbers of central memory (CM) and effector memory (EM) T cells responsive to a diverse set of epitopes are also correlated with protection. Promising vaccines against TB contain mycobacterial subunit antigens (Ag) present during both active and latent infection. We hypothesize that protection against different key immunodominant antigens could require a vaccine that produces different levels of EM and CM for each Ag-specific memory population. We created a computational model to explore EM and CM values, and their ratio, within what we term Memory Design Space. Our model captures events involved in T cell priming within lymph nodes and tracks their circulation through blood to peripheral tissues. We used the model to test whether multiple Ag-specific memory cell populations could be generated with distinct locations within Memory Design Space at a specific time point post vaccination. Boosting can further shift memory populations to memory cell ratios unreachable by initial priming events. By strategically varying antigen load, properties of cellular interactions within the LN, and delivery parameters (e.g., number of boosts) of multi-subunit vaccines, we can generate multiple Ag-specific memory populations that cover a wide range of Memory Design Space. As a result, given a set of desired characteristics for Ag-specific memory populations, we can use our model as a tool to predict vaccine formulations that will generate those populations.« less
Shape Control of Solar Collectors Using Shape Memory Alloy Actuators
NASA Technical Reports Server (NTRS)
Lobitz, D. W.; Grossman, J. W.; Allen, J. J.; Rice, T. M.; Liang, C.; Davidson, F. M.
1996-01-01
Solar collectors that are focused on a central receiver are designed with a mechanism for defocusing the collector or disabling it by turning it out of the path of the sun's rays. This is required to avoid damaging the receiver during periods of inoperability. In either of these two cases a fail-safe operation is very desirable where during power outages the collector passively goes to its defocused or deactivated state. This paper is principally concerned with focusing and defocusing the collector in a fail-safe manner using shape memory alloy actuators. Shape memory alloys are well suited to this application in that once calibrated the actuators can be operated in an on/off mode using a minimal amount of electric power. Also, in contrast to other smart materials that were investigated for this application, shape memory alloys are capable of providing enough stroke at the appropriate force levels to focus the collector. Design and analysis details presented, along with comparisons to test data taken from an actual prototype, demonstrate that the collector can be repeatedly focused and defocused within accuracies required by typical solar energy systems. In this paper the design, analysis and testing of a solar collector which is deformed into its desired shape by shape memory alloy actuators is presented. Computations indicate collector shapes much closer to spherical and with smaller focal lengths can be achieved by moving the actuators inward to a radius of approximately 6 inches. This would require actuators with considerably more stroke and some alternate SMA actuators are currently under consideration. Whatever SMA actuator is finally chosen for this application, repeatability and fatigue tests will be required to investigate the long term performance of the actuator.
NASA Astrophysics Data System (ADS)
Rastogi, Richa; Srivastava, Abhishek; Khonde, Kiran; Sirasala, Kirannmayi M.; Londhe, Ashutosh; Chavhan, Hitesh
2015-07-01
This paper presents an efficient parallel 3D Kirchhoff depth migration algorithm suitable for current class of multicore architecture. The fundamental Kirchhoff depth migration algorithm exhibits inherent parallelism however, when it comes to 3D data migration, as the data size increases the resource requirement of the algorithm also increases. This challenges its practical implementation even on current generation high performance computing systems. Therefore a smart parallelization approach is essential to handle 3D data for migration. The most compute intensive part of Kirchhoff depth migration algorithm is the calculation of traveltime tables due to its resource requirements such as memory/storage and I/O. In the current research work, we target this area and develop a competent parallel algorithm for post and prestack 3D Kirchhoff depth migration, using hybrid MPI+OpenMP programming techniques. We introduce a concept of flexi-depth iterations while depth migrating data in parallel imaging space, using optimized traveltime table computations. This concept provides flexibility to the algorithm by migrating data in a number of depth iterations, which depends upon the available node memory and the size of data to be migrated during runtime. Furthermore, it minimizes the requirements of storage, I/O and inter-node communication, thus making it advantageous over the conventional parallelization approaches. The developed parallel algorithm is demonstrated and analysed on Yuva II, a PARAM series of supercomputers. Optimization, performance and scalability experiment results along with the migration outcome show the effectiveness of the parallel algorithm.
Human Memory Organization for Computer Programs.
ERIC Educational Resources Information Center
Norcio, A. F.; Kerst, Stephen M.
1983-01-01
Results of study investigating human memory organization in processing of computer programming languages indicate that algorithmic logic segments form a cognitive organizational structure in memory for programs. Statement indentation and internal program documentation did not enhance organizational process of recall of statements in five Fortran…
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Palamuttam, R. S.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; Verma, R.; Waliser, D. E.; Lee, H.
2015-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark under a NASA AIST grant (PI Mattmann). Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based ApacheTM Hadoop by 100x in memory and by 10x on disk. SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 10 to 1000 compute nodes. This 2nd generation capability for NASA's Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes. We have implemented a parallel data ingest capability in which the user specifies desired variables (arrays) as several time-sorted lists of URL's (i.e. using OPeNDAP model.nc?varname, or local files). The specified variables are partitioned by time/space and then each Spark node pulls its bundle of arrays into memory to begin a computation pipeline. We also investigated the performance of several N-dim. array libraries (scala breeze, java jblas & netlib-java, and ND4J). We are currently developing science codes using ND4J and studying memory behavior on the JVM. On the pyspark side, many of our science codes already use the numpy and SciPy ecosystems. The talk will cover: the architecture of SciSpark, the design of the scientific RDD (sRDD) data structure, our efforts to integrate climate science algorithms in Python and Scala, parallel ingest and partitioning of A-Train satellite observations from HDF files and model grids from netCDF files, first parallel runs to compute comparison statistics and PDF's, and first metrics quantifying parallel speedups and memory & disk usage.
Computer Use and Its Effect on the Memory Process in Young and Adults
ERIC Educational Resources Information Center
Alliprandini, Paula Mariza Zedu; Straub, Sandra Luzia Wrobel; Brugnera, Elisangela; de Oliveira, Tânia Pitombo; Souza, Isabela Augusta Andrade
2013-01-01
This work investigates the effect of computer use in the memory process in young and adults under the Perceptual and Memory experimental conditions. The memory condition involved the phases acquisition of information and recovery, on time intervals (2 min, 24 hours and 1 week) on situations of pre and post-test (before and after the participants…
Computational Model-Based Prediction of Human Episodic Memory Performance Based on Eye Movements
NASA Astrophysics Data System (ADS)
Sato, Naoyuki; Yamaguchi, Yoko
Subjects' episodic memory performance is not simply reflected by eye movements. We use a ‘theta phase coding’ model of the hippocampus to predict subjects' memory performance from their eye movements. Results demonstrate the ability of the model to predict subjects' memory performance. These studies provide a novel approach to computational modeling in the human-machine interface.
Design knowledge capture for a corporate memory facility
NASA Technical Reports Server (NTRS)
Boose, John H.; Shema, David B.; Bradshaw, Jeffrey M.
1990-01-01
Currently, much of the information regarding decision alternatives and trade-offs made in the course of a major program development effort is not represented or retained in a way that permits computer-based reasoning over the life cycle of the program. The loss of this information results in problems in tracing design alternatives to requirements, in assessing the impact of change in requirements, and in configuration management. To address these problems, the problem was studied of building an intelligent, active corporate memory facility which would provide for the capture of the requirements and standards of a program, analyze the design alternatives and trade-offs made over the program's lifetime, and examine relationships between requirements and design trade-offs. Early phases of the work have concentrated on design knowledge capture for the Space Station Freedom. Tools are demonstrated and extended which helps automate and document engineering trade studies, and another tool is being developed to help designers interactively explore design alternatives and constraints.
Modality dependence and intermodal transfer in the Corsi Spatial Sequence Task: Screen vs. Floor.
Röser, Andrea; Hardiess, Gregor; Mallot, Hanspeter A
2016-07-01
Four versions of the Corsi Spatial Sequence Task (CSST) were tested in a complete within-subject design, investigating whether participants' performance depends on the modality of task presentation and reproduction that put different demands on spatial processing. Presentation of the sequence (encoding phase) and the reproduction (recall phase) were each carried out either on a computer screen or on the floor of a room, involving actual walking in the recall phase. Combinations of the two different encoding and recall procedures result in the modality conditions Screen-Screen, Screen-Floor, Floor-Screen, and Floor-Floor. Results show the expected decrease in performance with increasing sequence length, which is likely due to processing limitations of working memory. We also found differences in performance between the modality conditions indicating different involvements of spatial working memory processes. Participants performed best in the Screen-Screen modality condition. Floor-Screen and Floor-Floor modality conditions require additional working memory resources for reference frame transformation and spatial updating, respectively; the resulting impairment of the performance was about the same in these two conditions. Finally, the Screen-Floor modality condition requires both types of additional spatial demands and led to the poorest performance. Therefore, we suggest that besides the well-known spatial requirements of CSST, additional working memory resources are demanded in walking CSST supporting processes such as spatial updating, mental rotation, reference frame transformation, and the control of walking itself.
NASA Astrophysics Data System (ADS)
Tomaro, Robert F.
1998-07-01
The present research is aimed at developing a higher-order, spatially accurate scheme for both steady and unsteady flow simulations using unstructured meshes. The resulting scheme must work on a variety of general problems to ensure the creation of a flexible, reliable and accurate aerodynamic analysis tool. To calculate the flow around complex configurations, unstructured grids and the associated flow solvers have been developed. Efficient simulations require the minimum use of computer memory and computational times. Unstructured flow solvers typically require more computer memory than a structured flow solver due to the indirect addressing of the cells. The approach taken in the present research was to modify an existing three-dimensional unstructured flow solver to first decrease the computational time required for a solution and then to increase the spatial accuracy. The terms required to simulate flow involving non-stationary grids were also implemented. First, an implicit solution algorithm was implemented to replace the existing explicit procedure. Several test cases, including internal and external, inviscid and viscous, two-dimensional, three-dimensional and axi-symmetric problems, were simulated for comparison between the explicit and implicit solution procedures. The increased efficiency and robustness of modified code due to the implicit algorithm was demonstrated. Two unsteady test cases, a plunging airfoil and a wing undergoing bending and torsion, were simulated using the implicit algorithm modified to include the terms required for a moving and/or deforming grid. Secondly, a higher than second-order spatially accurate scheme was developed and implemented into the baseline code. Third- and fourth-order spatially accurate schemes were implemented and tested. The original dissipation was modified to include higher-order terms and modified near shock waves to limit pre- and post-shock oscillations. The unsteady cases were repeated using the higher-order spatially accurate code. The new solutions were compared with those obtained using the second-order spatially accurate scheme. Finally, the increased efficiency of using an implicit solution algorithm in a production Computational Fluid Dynamics flow solver was demonstrated for steady and unsteady flows. A third- and fourth-order spatially accurate scheme has been implemented creating a basis for a state-of-the-art aerodynamic analysis tool.
Numericware i: Identical by State Matrix Calculator
Kim, Bongsong; Beavis, William D
2017-01-01
We introduce software, Numericware i, to compute identical by state (IBS) matrix based on genotypic data. Calculating an IBS matrix with a large dataset requires large computer memory and takes lengthy processing time. Numericware i addresses these challenges with 2 algorithmic methods: multithreading and forward chopping. The multithreading allows computational routines to concurrently run on multiple central processing unit (CPU) processors. The forward chopping addresses memory limitation by dividing a dataset into appropriately sized subsets. Numericware i allows calculation of the IBS matrix for a large genotypic dataset using a laptop or a desktop computer. For comparison with different software, we calculated genetic relationship matrices using Numericware i, SPAGeDi, and TASSEL with the same genotypic dataset. Numericware i calculates IBS coefficients between 0 and 2, whereas SPAGeDi and TASSEL produce different ranges of values including negative values. The Pearson correlation coefficient between the matrices from Numericware i and TASSEL was high at .9972, whereas SPAGeDi showed low correlation with Numericware i (.0505) and TASSEL (.0587). With a high-dimensional dataset of 500 entities by 10 000 000 SNPs, Numericware i spent 382 minutes using 19 CPU threads and 64 GB memory by dividing the dataset into 3 pieces, whereas SPAGeDi and TASSEL failed with the same dataset. Numericware i is freely available for Windows and Linux under CC-BY 4.0 license at https://figshare.com/s/f100f33a8857131eb2db. PMID:28469375
Acceleration of discrete stochastic biochemical simulation using GPGPU.
Sumiyoshi, Kei; Hirata, Kazuki; Hiroi, Noriko; Funahashi, Akira
2015-01-01
For systems made up of a small number of molecules, such as a biochemical network in a single cell, a simulation requires a stochastic approach, instead of a deterministic approach. The stochastic simulation algorithm (SSA) simulates the stochastic behavior of a spatially homogeneous system. Since stochastic approaches produce different results each time they are used, multiple runs are required in order to obtain statistical results; this results in a large computational cost. We have implemented a parallel method for using SSA to simulate a stochastic model; the method uses a graphics processing unit (GPU), which enables multiple realizations at the same time, and thus reduces the computational time and cost. During the simulation, for the purpose of analysis, each time course is recorded at each time step. A straightforward implementation of this method on a GPU is about 16 times faster than a sequential simulation on a CPU with hybrid parallelization; each of the multiple simulations is run simultaneously, and the computational tasks within each simulation are parallelized. We also implemented an improvement to the memory access and reduced the memory footprint, in order to optimize the computations on the GPU. We also implemented an asynchronous data transfer scheme to accelerate the time course recording function. To analyze the acceleration of our implementation on various sizes of model, we performed SSA simulations on different model sizes and compared these computation times to those for sequential simulations with a CPU. When used with the improved time course recording function, our method was shown to accelerate the SSA simulation by a factor of up to 130.
Acceleration of discrete stochastic biochemical simulation using GPGPU
Sumiyoshi, Kei; Hirata, Kazuki; Hiroi, Noriko; Funahashi, Akira
2015-01-01
For systems made up of a small number of molecules, such as a biochemical network in a single cell, a simulation requires a stochastic approach, instead of a deterministic approach. The stochastic simulation algorithm (SSA) simulates the stochastic behavior of a spatially homogeneous system. Since stochastic approaches produce different results each time they are used, multiple runs are required in order to obtain statistical results; this results in a large computational cost. We have implemented a parallel method for using SSA to simulate a stochastic model; the method uses a graphics processing unit (GPU), which enables multiple realizations at the same time, and thus reduces the computational time and cost. During the simulation, for the purpose of analysis, each time course is recorded at each time step. A straightforward implementation of this method on a GPU is about 16 times faster than a sequential simulation on a CPU with hybrid parallelization; each of the multiple simulations is run simultaneously, and the computational tasks within each simulation are parallelized. We also implemented an improvement to the memory access and reduced the memory footprint, in order to optimize the computations on the GPU. We also implemented an asynchronous data transfer scheme to accelerate the time course recording function. To analyze the acceleration of our implementation on various sizes of model, we performed SSA simulations on different model sizes and compared these computation times to those for sequential simulations with a CPU. When used with the improved time course recording function, our method was shown to accelerate the SSA simulation by a factor of up to 130. PMID:25762936
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sreepathi, Sarat; Kumar, Jitendra; Mills, Richard T.
A proliferation of data from vast networks of remote sensing platforms (satellites, unmanned aircraft systems (UAS), airborne etc.), observational facilities (meteorological, eddy covariance etc.), state-of-the-art sensors, and simulation models offer unprecedented opportunities for scientific discovery. Unsupervised classification is a widely applied data mining approach to derive insights from such data. However, classification of very large data sets is a complex computational problem that requires efficient numerical algorithms and implementations on high performance computing (HPC) platforms. Additionally, increasing power, space, cooling and efficiency requirements has led to the deployment of hybrid supercomputing platforms with complex architectures and memory hierarchies like themore » Titan system at Oak Ridge National Laboratory. The advent of such accelerated computing architectures offers new challenges and opportunities for big data analytics in general and specifically, large scale cluster analysis in our case. Although there is an existing body of work on parallel cluster analysis, those approaches do not fully meet the needs imposed by the nature and size of our large data sets. Moreover, they had scaling limitations and were mostly limited to traditional distributed memory computing platforms. We present a parallel Multivariate Spatio-Temporal Clustering (MSTC) technique based on k-means cluster analysis that can target hybrid supercomputers like Titan. We developed a hybrid MPI, CUDA and OpenACC implementation that can utilize both CPU and GPU resources on computational nodes. We describe performance results on Titan that demonstrate the scalability and efficacy of our approach in processing large ecological data sets.« less
Efficacy of Code Optimization on Cache-based Processors
NASA Technical Reports Server (NTRS)
VanderWijngaart, Rob F.; Chancellor, Marisa K. (Technical Monitor)
1997-01-01
The current common wisdom in the U.S. is that the powerful, cost-effective supercomputers of tomorrow will be based on commodity (RISC) micro-processors with cache memories. Already, most distributed systems in the world use such hardware as building blocks. This shift away from vector supercomputers and towards cache-based systems has brought about a change in programming paradigm, even when ignoring issues of parallelism. Vector machines require inner-loop independence and regular, non-pathological memory strides (usually this means: non-power-of-two strides) to allow efficient vectorization of array operations. Cache-based systems require spatial and temporal locality of data, so that data once read from main memory and stored in high-speed cache memory is used optimally before being written back to main memory. This means that the most cache-friendly array operations are those that feature zero or unit stride, so that each unit of data read from main memory (a cache line) contains information for the next iteration in the loop. Moreover, loops ought to be 'fat', meaning that as many operations as possible are performed on cache data-provided instruction caches do not overflow and enough registers are available. If unit stride is not possible, for example because of some data dependency, then care must be taken to avoid pathological strides, just ads on vector computers. For cache-based systems the issues are more complex, due to the effects of associativity and of non-unit block (cache line) size. But there is more to the story. Most modern micro-processors are superscalar, which means that they can issue several (arithmetic) instructions per clock cycle, provided that there are enough independent instructions in the loop body. This is another argument for providing fat loop bodies. With these restrictions, it appears fairly straightforward to produce code that will run efficiently on any cache-based system. It can be argued that although some of the important computational algorithms employed at NASA Ames require different programming styles on vector machines and cache-based machines, respectively, neither architecture class appeared to be favored by particular algorithms in principle. Practice tells us that the situation is more complicated. This report presents observations and some analysis of performance tuning for cache-based systems. We point out several counterintuitive results that serve as a cautionary reminder that memory accesses are not the only factors that determine performance, and that within the class of cache-based systems, significant differences exist.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Haugen, Carl C.; Forget, Benoit; Smith, Kord S.
Most high performance computing systems being deployed currently and envisioned for the future are based on making use of heavy parallelism across many computational nodes and many concurrent cores. These types of heavily parallel systems often have relatively little memory per core but large amounts of computing capability. This places a significant constraint on how data storage is handled in many Monte Carlo codes. This is made even more significant in fully coupled multiphysics simulations, which requires simulations of many physical phenomena be carried out concurrently on individual processing nodes, which further reduces the amount of memory available for storagemore » of Monte Carlo data. As such, there has been a move towards on-the-fly nuclear data generation to reduce memory requirements associated with interpolation between pre-generated large nuclear data tables for a selection of system temperatures. Methods have been previously developed and implemented in MIT’s OpenMC Monte Carlo code for both the resolved resonance regime and the unresolved resonance regime, but are currently absent for the thermal energy regime. While there are many components involved in generating a thermal neutron scattering cross section on-the-fly, this work will focus on a proposed method for determining the energy and direction of a neutron after a thermal incoherent inelastic scattering event. This work proposes a rejection sampling based method using the thermal scattering kernel to determine the correct outgoing energy and angle. The goal of this project is to be able to treat the full S (a, ß) kernel for graphite, to assist in high fidelity simulations of the TREAT reactor at Idaho National Laboratory. The method is, however, sufficiently general to be applicable in other thermal scattering materials, and can be initially validated with the continuous analytic free gas model.« less
Fast maximum likelihood estimation using continuous-time neural point process models.
Lepage, Kyle Q; MacDonald, Christopher J
2015-06-01
A recent report estimates that the number of simultaneously recorded neurons is growing exponentially. A commonly employed statistical paradigm using discrete-time point process models of neural activity involves the computation of a maximum-likelihood estimate. The time to computate this estimate, per neuron, is proportional to the number of bins in a finely spaced discretization of time. By using continuous-time models of neural activity and the optimally efficient Gaussian quadrature, memory requirements and computation times are dramatically decreased in the commonly encountered situation where the number of parameters p is much less than the number of time-bins n. In this regime, with q equal to the quadrature order, memory requirements are decreased from O(np) to O(qp), and the number of floating-point operations are decreased from O(np(2)) to O(qp(2)). Accuracy of the proposed estimates is assessed based upon physiological consideration, error bounds, and mathematical results describing the relation between numerical integration error and numerical error affecting both parameter estimates and the observed Fisher information. A check is provided which is used to adapt the order of numerical integration. The procedure is verified in simulation and for hippocampal recordings. It is found that in 95 % of hippocampal recordings a q of 60 yields numerical error negligible with respect to parameter estimate standard error. Statistical inference using the proposed methodology is a fast and convenient alternative to statistical inference performed using a discrete-time point process model of neural activity. It enables the employment of the statistical methodology available with discrete-time inference, but is faster, uses less memory, and avoids any error due to discretization.
2010-07-22
dependent , providing a natural bandwidth match between compute cores and the memory subsystem. • High Bandwidth Dcnsity. Waveguides crossing the chip...simulate this memory access architecture on a 2S6-core chip with a concentrated 64-node network lIsing detailed traces of high-performance embedded...memory modulcs, wc placc memory access poi nts (MAPs) around the pcriphery of the chip connected to thc nctwork. These MAPs, shown in Figure 4, contain
2015-06-01
examine how a computer forensic investigator/incident handler, without specialised computer memory or software reverse engineering skills , can successfully...memory images and malware, this new series of reports will be directed at those who must analyse Linux malware-infected memory images. The skills ...disable 1287 1000 1000 /usr/lib/policykit-1-gnome/polkit-gnome-authentication- agent-1 1310 1000 1000 /usr/lib/pulseaudio/pulse/gconf- helper 1350
NASA Astrophysics Data System (ADS)
Bouchet, L.; Amestoy, P.; Buttari, A.; Rouet, F.-H.; Chauvin, M.
2013-02-01
Nowadays, analyzing and reducing the ever larger astronomical datasets is becoming a crucial challenge, especially for long cumulated observation times. The INTEGRAL/SPI X/γ-ray spectrometer is an instrument for which it is essential to process many exposures at the same time in order to increase the low signal-to-noise ratio of the weakest sources. In this context, the conventional methods for data reduction are inefficient and sometimes not feasible at all. Processing several years of data simultaneously requires computing not only the solution of a large system of equations, but also the associated uncertainties. We aim at reducing the computation time and the memory usage. Since the SPI transfer function is sparse, we have used some popular methods for the solution of large sparse linear systems; we briefly review these methods. We use the Multifrontal Massively Parallel Solver (MUMPS) to compute the solution of the system of equations. We also need to compute the variance of the solution, which amounts to computing selected entries of the inverse of the sparse matrix corresponding to our linear system. This can be achieved through one of the latest features of the MUMPS software that has been partly motivated by this work. In this paper we provide a brief presentation of this feature and evaluate its effectiveness on astrophysical problems requiring the processing of large datasets simultaneously, such as the study of the entire emission of the Galaxy. We used these algorithms to solve the large sparse systems arising from SPI data processing and to obtain both their solutions and the associated variances. In conclusion, thanks to these newly developed tools, processing large datasets arising from SPI is now feasible with both a reasonable execution time and a low memory usage.
Wiener, J M; Ehbauer, N N; Mallot, H A
2009-09-01
For large numbers of targets, path planning is a complex and computationally expensive task. Humans, however, usually solve such tasks quickly and efficiently. We present experiments studying human path planning performance and the cognitive processes and heuristics involved. Twenty-five places were arranged on a regular grid in a large room. Participants were repeatedly asked to solve traveling salesman problems (TSP), i.e., to find the shortest closed loop connecting a start location with multiple target locations. In Experiment 1, we tested whether humans employed the nearest neighbor (NN) strategy when solving the TSP. Results showed that subjects outperform the NN-strategy, suggesting that it is not sufficient to explain human route planning behavior. As a second possible strategy we tested a hierarchical planning heuristic in Experiment 2, demonstrating that participants first plan a coarse route on the region level that is refined during navigation. To test for the relevance of spatial working memory (SWM) and spatial long-term memory (LTM) for planning performance and the planning heuristics applied, we varied the memory demands between conditions in Experiment 2. In one condition the target locations were directly marked, such that no memory was required; a second condition required participants to memorize the target locations during path planning (SWM); in a third condition, additionally, the locations of targets had to retrieved from LTM (SWM and LTM). Results showed that navigation performance decreased with increasing memory demands while the dependence on the hierarchical planning heuristic increased.
Application-Controlled Demand Paging for Out-of-Core Visualization
NASA Technical Reports Server (NTRS)
Cox, Michael; Ellsworth, David; Kutler, Paul (Technical Monitor)
1997-01-01
In the area of scientific visualization, input data sets are often very large. In visualization of Computational Fluid Dynamics (CFD) in particular, input data sets today can surpass 100 Gbytes, and are expected to scale with the ability of supercomputers to generate them. Some visualization tools already partition large data sets into segments, and load appropriate segments as they are needed. However, this does not remove the problem for two reasons: 1) there are data sets for which even the individual segments are too large for the largest graphics workstations, 2) many practitioners do not have access to workstations with the memory capacity required to load even a segment, especially since the state-of-the-art visualization tools tend to be developed by researchers with much more powerful machines. When the size of the data that must be accessed is larger than the size of memory, some form of virtual memory is simply required. This may be by segmentation, paging, or by paged segments. In this paper we demonstrate that complete reliance on operating system virtual memory for out-of-core visualization leads to poor performance. We then describe a paged segment system that we have implemented, and explore the principles of memory management that can be employed by the application for out-of-core visualization. We show that application control over some of these can significantly improve performance. We show that sparse traversal can be exploited by loading only those data actually required. We show also that application control over data loading can be exploited by 1) loading data from alternative storage format (in particular 3-dimensional data stored in sub-cubes), 2) controlling the page size. Both of these techniques effectively reduce the total memory required by visualization at run-time. We also describe experiments we have done on remote out-of-core visualization (when pages are read by demand from remote disk) whose results are promising.
Nonvolatile “AND,” “OR,” and “NOT” Boolean logic gates based on phase-change memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Li, Y.; Zhong, Y. P.; Deng, Y. F.
2013-12-21
Electronic devices or circuits that can implement both logic and memory functions are regarded as the building blocks for future massive parallel computing beyond von Neumann architecture. Here we proposed phase-change memory (PCM)-based nonvolatile logic gates capable of AND, OR, and NOT Boolean logic operations verified in SPICE simulations and circuit experiments. The logic operations are parallel computing and results can be stored directly in the states of the logic gates, facilitating the combination of computing and memory in the same circuit. These results are encouraging for ultralow-power and high-speed nonvolatile logic circuit design based on novel memory devices.
OS friendly microprocessor architecture: Hardware level computer security
NASA Astrophysics Data System (ADS)
Jungwirth, Patrick; La Fratta, Patrick
2016-05-01
We present an introduction to the patented OS Friendly Microprocessor Architecture (OSFA) and hardware level computer security. Conventional microprocessors have not tried to balance hardware performance and OS performance at the same time. Conventional microprocessors have depended on the Operating System for computer security and information assurance. The goal of the OS Friendly Architecture is to provide a high performance and secure microprocessor and OS system. We are interested in cyber security, information technology (IT), and SCADA control professionals reviewing the hardware level security features. The OS Friendly Architecture is a switched set of cache memory banks in a pipeline configuration. For light-weight threads, the memory pipeline configuration provides near instantaneous context switching times. The pipelining and parallelism provided by the cache memory pipeline provides for background cache read and write operations while the microprocessor's execution pipeline is running instructions. The cache bank selection controllers provide arbitration to prevent the memory pipeline and microprocessor's execution pipeline from accessing the same cache bank at the same time. This separation allows the cache memory pages to transfer to and from level 1 (L1) caching while the microprocessor pipeline is executing instructions. Computer security operations are implemented in hardware. By extending Unix file permissions bits to each cache memory bank and memory address, the OSFA provides hardware level computer security.
Applying Parallel Adaptive Methods with GeoFEST/PYRAMID to Simulate Earth Surface Crustal Dynamics
NASA Technical Reports Server (NTRS)
Norton, Charles D.; Lyzenga, Greg; Parker, Jay; Glasscoe, Margaret; Donnellan, Andrea; Li, Peggy
2006-01-01
This viewgraph presentation reviews the use Adaptive Mesh Refinement (AMR) in simulating the Crustal Dynamics of Earth's Surface. AMR simultaneously improves solution quality, time to solution, and computer memory requirements when compared to generating/running on a globally fine mesh. The use of AMR in simulating the dynamics of the Earth's Surface is spurred by future proposed NASA missions, such as InSAR for Earth surface deformation and other measurements. These missions will require support for large-scale adaptive numerical methods using AMR to model observations. AMR was chosen because it has been successful in computation fluid dynamics for predictive simulation of complex flows around complex structures.
Extreme-scale Algorithms and Solver Resilience
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dongarra, Jack
A widening gap exists between the peak performance of high-performance computers and the performance achieved by complex applications running on these platforms. Over the next decade, extreme-scale systems will present major new challenges to algorithm development that could amplify this mismatch in such a way that it prevents the productive use of future DOE Leadership computers due to the following; Extreme levels of parallelism due to multicore processors; An increase in system fault rates requiring algorithms to be resilient beyond just checkpoint/restart; Complex memory hierarchies and costly data movement in both energy and performance; Heterogeneous system architectures (mixing CPUs, GPUs,more » etc.); and Conflicting goals of performance, resilience, and power requirements.« less
Klapan, Ivica; Vranjes, Zeljko; Prgomet, Drago; Lukinović, Juraj
2008-03-01
The real-time requirement means that the simulation should be able to follow the actions of the user that may be moving in the virtual environment. The computer system should also store in its memory a three-dimensional (3D) model of the virtual environment. In that case a real-time virtual reality system will update the 3D graphic visualization as the user moves, so that up-to-date visualization is always shown on the computer screen. Upon completion of the tele-operation, the surgeon compares the preoperative and postoperative images and models of the operative field, and studies video records of the procedure itself Using intraoperative records, animated images of the real tele-procedure performed can be designed. Virtual surgery offers the possibility of preoperative planning in rhinology. The intraoperative use of computer in real time requires development of appropriate hardware and software to connect medical instrumentarium with the computer and to operate the computer by thus connected instrumentarium and sophisticated multimedia interfaces.
Molecular dynamics simulations through GPU video games technologies
Loukatou, Styliani; Papageorgiou, Louis; Fakourelis, Paraskevas; Filntisi, Arianna; Polychronidou, Eleftheria; Bassis, Ioannis; Megalooikonomou, Vasileios; Makałowski, Wojciech; Vlachakis, Dimitrios; Kossida, Sophia
2016-01-01
Bioinformatics is the scientific field that focuses on the application of computer technology to the management of biological information. Over the years, bioinformatics applications have been used to store, process and integrate biological and genetic information, using a wide range of methodologies. One of the most de novo techniques used to understand the physical movements of atoms and molecules is molecular dynamics (MD). MD is an in silico method to simulate the physical motions of atoms and molecules under certain conditions. This has become a state strategic technique and now plays a key role in many areas of exact sciences, such as chemistry, biology, physics and medicine. Due to their complexity, MD calculations could require enormous amounts of computer memory and time and therefore their execution has been a big problem. Despite the huge computational cost, molecular dynamics have been implemented using traditional computers with a central memory unit (CPU). A graphics processing unit (GPU) computing technology was first designed with the goal to improve video games, by rapidly creating and displaying images in a frame buffer such as screens. The hybrid GPU-CPU implementation, combined with parallel computing is a novel technology to perform a wide range of calculations. GPUs have been proposed and used to accelerate many scientific computations including MD simulations. Herein, we describe the new methodologies developed initially as video games and how they are now applied in MD simulations. PMID:27525251
NASA Astrophysics Data System (ADS)
Thompson, Kyle Bonner
An algorithm is described to efficiently compute aerothermodynamic design sensitivities using a decoupled variable set. In a conventional approach to computing design sensitivities for reacting flows, the species continuity equations are fully coupled to the conservation laws for momentum and energy. In this algorithm, the species continuity equations are solved separately from the mixture continuity, momentum, and total energy equations. This decoupling simplifies the implicit system, so that the flow solver can be made significantly more efficient, with very little penalty on overall scheme robustness. Most importantly, the computational cost of the point implicit relaxation is shown to scale linearly with the number of species for the decoupled system, whereas the fully coupled approach scales quadratically. Also, the decoupled method significantly reduces the cost in wall time and memory in comparison to the fully coupled approach. This decoupled approach for computing design sensitivities with the adjoint system is demonstrated for inviscid flow in chemical non-equilibrium around a re-entry vehicle with a retro-firing annular nozzle. The sensitivities of the surface temperature and mass flow rate through the nozzle plenum are computed with respect to plenum conditions and verified against sensitivities computed using a complex-variable finite-difference approach. The decoupled scheme significantly reduces the computational time and memory required to complete the optimization, making this an attractive method for high-fidelity design of hypersonic vehicles.
CUDA Optimization Strategies for Compute- and Memory-Bound Neuroimaging Algorithms
Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W.
2011-01-01
As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. PMID:21159404
CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms.
Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W
2012-06-01
As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.
NASA Technical Reports Server (NTRS)
Koppenhoefer, Kyle C.; Gullerud, Arne S.; Ruggieri, Claudio; Dodds, Robert H., Jr.; Healy, Brian E.
1998-01-01
This report describes theoretical background material and commands necessary to use the WARP3D finite element code. WARP3D is under continuing development as a research code for the solution of very large-scale, 3-D solid models subjected to static and dynamic loads. Specific features in the code oriented toward the investigation of ductile fracture in metals include a robust finite strain formulation, a general J-integral computation facility (with inertia, face loading), an element extinction facility to model crack growth, nonlinear material models including viscoplastic effects, and the Gurson-Tver-gaard dilatant plasticity model for void growth. The nonlinear, dynamic equilibrium equations are solved using an incremental-iterative, implicit formulation with full Newton iterations to eliminate residual nodal forces. The history integration of the nonlinear equations of motion is accomplished with Newmarks Beta method. A central feature of WARP3D involves the use of a linear-preconditioned conjugate gradient (LPCG) solver implemented in an element-by-element format to replace a conventional direct linear equation solver. This software architecture dramatically reduces both the memory requirements and CPU time for very large, nonlinear solid models since formation of the assembled (dynamic) stiffness matrix is avoided. Analyses thus exhibit the numerical stability for large time (load) steps provided by the implicit formulation coupled with the low memory requirements characteristic of an explicit code. In addition to the much lower memory requirements of the LPCG solver, the CPU time required for solution of the linear equations during each Newton iteration is generally one-half or less of the CPU time required for a traditional direct solver. All other computational aspects of the code (element stiffnesses, element strains, stress updating, element internal forces) are implemented in the element-by- element, blocked architecture. This greatly improves vectorization of the code on uni-processor hardware and enables straightforward parallel-vector processing of element blocks on multi-processor hardware.
ERIC Educational Resources Information Center
Linster, Christiane; Menon, Alka V.; Singh, Christopher Y.; Wilson, Donald A.
2009-01-01
Segmentation of target odorants from background odorants is a fundamental computational requirement for the olfactory system and is thought to be behaviorally mediated by olfactory habituation memory. Data from our laboratory have shown that odor-specific adaptation in piriform neurons, mediated at least partially by synaptic adaptation between…
USSR Report, Cybernetics, Computers and Automation Technology
1985-08-28
alphanumerical dis- plays SM 7206 and SM 7401, graphics displays SM 7300 and SM 7301, modems SM 8105, SM 8107 and SM 8108, and so on. Today it is...3 are written in assembler and PL-1. They require 56k to 200 k memory for operation. S0RT-7/SM sorting subsystem was developed in conjunction with
Microprogramming Handbook. Second Edition.
ERIC Educational Resources Information Center
Microdata Corp., Santa Ana, CA.
Instead of instructions residing in the main memory as in a fixed instruction computer, a micro-programable computer has a separete read-only memory which is alterable so that the system can be efficiently adapted to the application at hand. Microprogramable computers are faster than fixed instruction computers for several reasons: instruction…
A highly efficient multi-core algorithm for clustering extremely large datasets
2010-01-01
Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer. PMID:20370922
Memory management in genome-wide association studies
2009-01-01
Genome-wide association is a powerful tool for the identification of genes that underlie common diseases. Genome-wide association studies generate billions of genotypes and pose significant computational challenges for most users including limited computer memory. We applied a recently developed memory management tool to two analyses of North American Rheumatoid Arthritis Consortium studies and measured the performance in terms of central processing unit and memory usage. We conclude that our memory management approach is simple, efficient, and effective for genome-wide association studies. PMID:20018047
Computer memory power control for the Galileo spacecraft
NASA Technical Reports Server (NTRS)
Detwiler, R. C.
1983-01-01
The developmental history, major design drives, and final topology of the computer memory power system on the Galileo spacecraft are described. A unique method of generating memory backup power directly from the fault current drawn during a spacecraft power overload or fault condition allows this system to provide continuous memory power. This concept provides a unique solution to the problem of volatile memory loss without the use of a battery of other large energy storage elements usually associated with uninterrupted power supply designs.
Density-matrix simulation of small surface codes under current and projected experimental noise
NASA Astrophysics Data System (ADS)
O'Brien, T. E.; Tarasinski, B.; DiCarlo, L.
2017-09-01
We present a density-matrix simulation of the quantum memory and computing performance of the distance-3 logical qubit Surface-17, following a recently proposed quantum circuit and using experimental error parameters for transmon qubits in a planar circuit QED architecture. We use this simulation to optimize components of the QEC scheme (e.g., trading off stabilizer measurement infidelity for reduced cycle time) and to investigate the benefits of feedback harnessing the fundamental asymmetry of relaxation-dominated error in the constituent transmons. A lower-order approximate calculation extends these predictions to the distance-5 Surface-49. These results clearly indicate error rates below the fault-tolerance threshold of the surface code, and the potential for Surface-17 to perform beyond the break-even point of quantum memory. However, Surface-49 is required to surpass the break-even point of computation at state-of-the-art qubit relaxation times and readout speeds.
Sato, Y; Wadamoto, M; Tsuga, K; Teixeira, E R
1999-04-01
More validity of finite element analysis in implant biomechanics requires element downsizing. However, excess downsizing needs computer memory and calculation time. To investigate the effectiveness of element downsizing on the construction of a three-dimensional finite element bone trabeculae model, with different element sizes (600, 300, 150 and 75 microm) models were constructed and stress induced by vertical 10 N loading was analysed. The difference in von Mises stress values between the models with 600 and 300 microm element sizes was larger than that between 300 and 150 microm. On the other hand, no clear difference of stress values was detected among the models with 300, 150 and 75 microm element sizes. Downsizing of elements from 600 to 300 microm is suggested to be effective in the construction of a three-dimensional finite element bone trabeculae model for possible saving of computer memory and calculation time in the laboratory.
Performance Prediction Toolkit
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chennupati, Gopinath; Santhi, Nanadakishore; Eidenbenz, Stephen
The Performance Prediction Toolkit (PPT), is a scalable co-design tool that contains the hardware and middle-ware models, which accept proxy applications as input in runtime prediction. PPT relies on Simian, a parallel discrete event simulation engine in Python or Lua, that uses the process concept, where each computing unit (host, node, core) is a Simian entity. Processes perform their task through message exchanges to remain active, sleep, wake-up, begin and end. The PPT hardware model of a compute core (such as a Haswell core) consists of a set of parameters, such as clock speed, memory hierarchy levels, their respective sizes,more » cache-lines, access times for different cache levels, average cycle counts of ALU operations, etc. These parameters are ideally read off a spec sheet or are learned using regression models learned from hardware counters (PAPI) data. The compute core model offers an API to the software model, a function called time_compute(), which takes as input a tasklist. A tasklist is an unordered set of ALU, and other CPU-type operations (in particular virtual memory loads and stores). The PPT application model mimics the loop structure of the application and replaces the computational kernels with a call to the hardware model's time_compute() function giving tasklists as input that model the compute kernel. A PPT application model thus consists of tasklists representing kernels and the high-er level loop structure that we like to think of as pseudo code. The key challenge for the hardware model's time_compute-function is to translate virtual memory accesses into actual cache hierarchy level hits and misses.PPT also contains another CPU core level hardware model, Analytical Memory Model (AMM). The AMM solves this challenge soundly, where our previous alternatives explicitly include the L1,L2,L3 hit-rates as inputs to the tasklists. Explicit hit-rates inevitably only reflect the application modeler's best guess, perhaps informed by a few small test problems using hardware counters; also, hard-coded hit-rates make the hardware model insensitive to changes in cache sizes. Alternatively, we use reuse distance distributions in the tasklists. In general, reuse profiles require the application modeler to run a very expensive trace analysis on the real code that realistically can be done at best for small examples.« less
Building logical qubits in a superconducting quantum computing system
NASA Astrophysics Data System (ADS)
Gambetta, Jay M.; Chow, Jerry M.; Steffen, Matthias
2017-01-01
The technological world is in the midst of a quantum computing and quantum information revolution. Since Richard Feynman's famous `plenty of room at the bottom' lecture (Feynman, Engineering and Science23, 22 (1960)), hinting at the notion of novel devices employing quantum mechanics, the quantum information community has taken gigantic strides in understanding the potential applications of a quantum computer and laid the foundational requirements for building one. We believe that the next significant step will be to demonstrate a quantum memory, in which a system of interacting qubits stores an encoded logical qubit state longer than the incorporated parts. Here, we describe the important route towards a logical memory with superconducting qubits, employing a rotated version of the surface code. The current status of technology with regards to interconnected superconducting-qubit networks will be described and near-term areas of focus to improve devices will be identified. Overall, the progress in this exciting field has been astounding, but we are at an important turning point, where it will be critical to incorporate engineering solutions with quantum architectural considerations, laying the foundation towards scalable fault-tolerant quantum computers in the near future.
Synthetic Analog and Digital Circuits for Cellular Computation and Memory
Purcell, Oliver; Lu, Timothy K.
2014-01-01
Biological computation is a major area of focus in synthetic biology because it has the potential to enable a wide range of applications. Synthetic biologists have applied engineering concepts to biological systems in order to construct progressively more complex gene circuits capable of processing information in living cells. Here, we review the current state of computational genetic circuits and describe artificial gene circuits that perform digital and analog computation. We then discuss recent progress in designing gene circuits that exhibit memory, and how memory and computation have been integrated to yield more complex systems that can both process and record information. Finally, we suggest new directions for engineering biological circuits capable of computation. PMID:24794536
On-chip phase-change photonic memory and computing
NASA Astrophysics Data System (ADS)
Cheng, Zengguang; Ríos, Carlos; Youngblood, Nathan; Wright, C. David; Pernice, Wolfram H. P.; Bhaskaran, Harish
2017-08-01
The use of photonics in computing is a hot topic of interest, driven by the need for ever-increasing speed along with reduced power consumption. In existing computing architectures, photonic data storage would dramatically improve the performance by reducing latencies associated with electrical memories. At the same time, the rise of `big data' and `deep learning' is driving the quest for non-von Neumann and brain-inspired computing paradigms. To succeed in both aspects, we have demonstrated non-volatile multi-level photonic memory avoiding the von Neumann bottleneck in the existing computing paradigm and a photonic synapse resembling the biological synapses for brain-inspired computing using phase-change materials (Ge2Sb2Te5).
GPU-accelerated Modeling and Element-free Reverse-time Migration with Gauss Points Partition
NASA Astrophysics Data System (ADS)
Zhen, Z.; Jia, X.
2014-12-01
Element-free method (EFM) has been applied to seismic modeling and migration. Compared with finite element method (FEM) and finite difference method (FDM), it is much cheaper and more flexible because only the information of the nodes and the boundary of the study area are required in computation. In the EFM, the number of Gauss points should be consistent with the number of model nodes; otherwise the accuracy of the intermediate coefficient matrices would be harmed. Thus when we increase the nodes of velocity model in order to obtain higher resolution, we find that the size of the computer's memory will be a bottleneck. The original EFM can deal with at most 81×81 nodes in the case of 2G memory, as tested by Jia and Hu (2006). In order to solve the problem of storage and computation efficiency, we propose a concept of Gauss points partition (GPP), and utilize the GPUs to improve the computation efficiency. Considering the characteristics of the Gaussian points, the GPP method doesn't influence the propagation of seismic wave in the velocity model. To overcome the time-consuming computation of the stiffness matrix (K) and the mass matrix (M), we also use the GPUs in our computation program. We employ the compressed sparse row (CSR) format to compress the intermediate sparse matrices and try to simplify the operations by solving the linear equations with the CULA Sparse's Conjugate Gradient (CG) solver instead of the linear sparse solver 'PARDISO'. It is observed that our strategy can significantly reduce the computational time of K and Mcompared with the algorithm based on CPU. The model tested is Marmousi model. The length of the model is 7425m and the depth is 2990m. We discretize the model with 595x298 nodes, 300x300 Gauss cells and 3x3 Gauss points in each cell. In contrast to the computational time of the conventional EFM, the GPUs-GPP approach can substantially improve the efficiency. The speedup ratio of time consumption of computing K, M is 120 and the speedup ratio time consumption of RTM is 11.5. At the same time, the accuracy of imaging is not harmed. Another advantage of the GPUs-GPP method is its easy applications in other numerical methods such as the FEM. Finally, in the GPUs-GPP method, the arrays require quite limited memory storage, which makes the method promising in dealing with large-scale 3D problems.
Performance Models for Split-execution Computing Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Humble, Travis S; McCaskey, Alex; Schrock, Jonathan
Split-execution computing leverages the capabilities of multiple computational models to solve problems, but splitting program execution across different computational models incurs costs associated with the translation between domains. We analyze the performance of a split-execution computing system developed from conventional and quantum processing units (QPUs) by using behavioral models that track resource usage. We focus on asymmetric processing models built using conventional CPUs and a family of special-purpose QPUs that employ quantum computing principles. Our performance models account for the translation of a classical optimization problem into the physical representation required by the quantum processor while also accounting for hardwaremore » limitations and conventional processor speed and memory. We conclude that the bottleneck in this split-execution computing system lies at the quantum-classical interface and that the primary time cost is independent of quantum processor behavior.« less
NASA Astrophysics Data System (ADS)
Jaber, Khalid Mohammad; Alia, Osama Moh'd.; Shuaib, Mohammed Mahmod
2018-03-01
Finding the optimal parameters that can reproduce experimental data (such as the velocity-density relation and the specific flow rate) is a very important component of the validation and calibration of microscopic crowd dynamic models. Heavy computational demand during parameter search is a known limitation that exists in a previously developed model known as the Harmony Search-Based Social Force Model (HS-SFM). In this paper, a parallel-based mechanism is proposed to reduce the computational time and memory resource utilisation required to find these parameters. More specifically, two MATLAB-based multicore techniques (parfor and create independent jobs) using shared memory are developed by taking advantage of the multithreading capabilities of parallel computing, resulting in a new framework called the Parallel Harmony Search-Based Social Force Model (P-HS-SFM). The experimental results show that the parfor-based P-HS-SFM achieved a better computational time of about 26 h, an efficiency improvement of ? 54% and a speedup factor of 2.196 times in comparison with the HS-SFM sequential processor. The performance of the P-HS-SFM using the create independent jobs approach is also comparable to parfor with a computational time of 26.8 h, an efficiency improvement of about 30% and a speedup of 2.137 times.
Lu, Xiaofeng; Song, Li; Shen, Sumin; He, Kang; Yu, Songyu; Ling, Nam
2013-01-01
Hough Transform has been widely used for straight line detection in low-definition and still images, but it suffers from execution time and resource requirements. Field Programmable Gate Arrays (FPGA) provide a competitive alternative for hardware acceleration to reap tremendous computing performance. In this paper, we propose a novel parallel Hough Transform (PHT) and FPGA architecture-associated framework for real-time straight line detection in high-definition videos. A resource-optimized Canny edge detection method with enhanced non-maximum suppression conditions is presented to suppress most possible false edges and obtain more accurate candidate edge pixels for subsequent accelerated computation. Then, a novel PHT algorithm exploiting spatial angle-level parallelism is proposed to upgrade computational accuracy by improving the minimum computational step. Moreover, the FPGA based multi-level pipelined PHT architecture optimized by spatial parallelism ensures real-time computation for 1,024 × 768 resolution videos without any off-chip memory consumption. This framework is evaluated on ALTERA DE2-115 FPGA evaluation platform at a maximum frequency of 200 MHz, and it can calculate straight line parameters in 15.59 ms on the average for one frame. Qualitative and quantitative evaluation results have validated the system performance regarding data throughput, memory bandwidth, resource, speed and robustness. PMID:23867746
Lu, Xiaofeng; Song, Li; Shen, Sumin; He, Kang; Yu, Songyu; Ling, Nam
2013-07-17
Hough Transform has been widely used for straight line detection in low-definition and still images, but it suffers from execution time and resource requirements. Field Programmable Gate Arrays (FPGA) provide a competitive alternative for hardware acceleration to reap tremendous computing performance. In this paper, we propose a novel parallel Hough Transform (PHT) and FPGA architecture-associated framework for real-time straight line detection in high-definition videos. A resource-optimized Canny edge detection method with enhanced non-maximum suppression conditions is presented to suppress most possible false edges and obtain more accurate candidate edge pixels for subsequent accelerated computation. Then, a novel PHT algorithm exploiting spatial angle-level parallelism is proposed to upgrade computational accuracy by improving the minimum computational step. Moreover, the FPGA based multi-level pipelined PHT architecture optimized by spatial parallelism ensures real-time computation for 1,024 × 768 resolution videos without any off-chip memory consumption. This framework is evaluated on ALTERA DE2-115 FPGA evaluation platform at a maximum frequency of 200 MHz, and it can calculate straight line parameters in 15.59 ms on the average for one frame. Qualitative and quantitative evaluation results have validated the system performance regarding data throughput, memory bandwidth, resource, speed and robustness.
From Three-Photon Greenberger-Horne-Zeilinger States to Ballistic Universal Quantum Computation.
Gimeno-Segovia, Mercedes; Shadbolt, Pete; Browne, Dan E; Rudolph, Terry
2015-07-10
Single photons, manipulated using integrated linear optics, constitute a promising platform for universal quantum computation. A series of increasingly efficient proposals have shown linear-optical quantum computing to be formally scalable. However, existing schemes typically require extensive adaptive switching, which is experimentally challenging and noisy, thousands of photon sources per renormalized qubit, and/or large quantum memories for repeat-until-success strategies. Our work overcomes all these problems. We present a scheme to construct a cluster state universal for quantum computation, which uses no adaptive switching, no large memories, and which is at least an order of magnitude more resource efficient than previous passive schemes. Unlike previous proposals, it is constructed entirely from loss-detecting gates and offers a robustness to photon loss. Even without the use of an active loss-tolerant encoding, our scheme naturally tolerates a total loss rate ∼1.6% in the photons detected in the gates. This scheme uses only 3 Greenberger-Horne-Zeilinger states as a resource, together with a passive linear-optical network. We fully describe and model the iterative process of cluster generation, including photon loss and gate failure. This demonstrates that building a linear-optical quantum computer needs to be less challenging than previously thought.
Roads towards fault-tolerant universal quantum computation
NASA Astrophysics Data System (ADS)
Campbell, Earl T.; Terhal, Barbara M.; Vuillot, Christophe
2017-09-01
A practical quantum computer must not merely store information, but also process it. To prevent errors introduced by noise from multiplying and spreading, a fault-tolerant computational architecture is required. Current experiments are taking the first steps toward noise-resilient logical qubits. But to convert these quantum devices from memories to processors, it is necessary to specify how a universal set of gates is performed on them. The leading proposals for doing so, such as magic-state distillation and colour-code techniques, have high resource demands. Alternative schemes, such as those that use high-dimensional quantum codes in a modular architecture, have potential benefits, but need to be explored further.
Roads towards fault-tolerant universal quantum computation.
Campbell, Earl T; Terhal, Barbara M; Vuillot, Christophe
2017-09-13
A practical quantum computer must not merely store information, but also process it. To prevent errors introduced by noise from multiplying and spreading, a fault-tolerant computational architecture is required. Current experiments are taking the first steps toward noise-resilient logical qubits. But to convert these quantum devices from memories to processors, it is necessary to specify how a universal set of gates is performed on them. The leading proposals for doing so, such as magic-state distillation and colour-code techniques, have high resource demands. Alternative schemes, such as those that use high-dimensional quantum codes in a modular architecture, have potential benefits, but need to be explored further.
Machine vision for real time orbital operations
NASA Technical Reports Server (NTRS)
Vinz, Frank L.
1988-01-01
Machine vision for automation and robotic operation of Space Station era systems has the potential for increasing the efficiency of orbital servicing, repair, assembly and docking tasks. A machine vision research project is described in which a TV camera is used for inputing visual data to a computer so that image processing may be achieved for real time control of these orbital operations. A technique has resulted from this research which reduces computer memory requirements and greatly increases typical computational speed such that it has the potential for development into a real time orbital machine vision system. This technique is called AI BOSS (Analysis of Images by Box Scan and Syntax).
GPU-based Parallel Application Design for Emerging Mobile Devices
NASA Astrophysics Data System (ADS)
Gupta, Kshitij
A revolution is underway in the computing world that is causing a fundamental paradigm shift in device capabilities and form-factor, with a move from well-established legacy desktop/laptop computers to mobile devices in varying sizes and shapes. Amongst all the tasks these devices must support, graphics has emerged as the 'killer app' for providing a fluid user interface and high-fidelity game rendering, effectively making the graphics processor (GPU) one of the key components in (present and future) mobile systems. By utilizing the GPU as a general-purpose parallel processor, this dissertation explores the GPU computing design space from an applications standpoint, in the mobile context, by focusing on key challenges presented by these devices---limited compute, memory bandwidth, and stringent power consumption requirements---while improving the overall application efficiency of the increasingly important speech recognition workload for mobile user interaction. We broadly partition trends in GPU computing into four major categories. We analyze hardware and programming model limitations in current-generation GPUs and detail an alternate programming style called Persistent Threads, identify four use case patterns, and propose minimal modifications that would be required for extending native support. We show how by manually extracting data locality and altering the speech recognition pipeline, we are able to achieve significant savings in memory bandwidth while simultaneously reducing the compute burden on GPU-like parallel processors. As we foresee GPU computing to evolve from its current 'co-processor' model into an independent 'applications processor' that is capable of executing complex work independently, we create an alternate application framework that enables the GPU to handle all control-flow dependencies autonomously at run-time while minimizing host involvement to just issuing commands, that facilitates an efficient application implementation. Finally, as compute and communication capabilities of mobile devices improve, we analyze energy implications of processing speech recognition locally (on-chip) and offloading it to servers (in-cloud).
NASA Technical Reports Server (NTRS)
Witkop, D. L.; Dale, B. J.; Gellin, S.
1991-01-01
The programming aspects of SFENES are described in the User's Manual. The information presented is provided for the installation programmer. It is sufficient to fully describe the general program logic and required peripheral storage. All element generated data is stored externally to reduce required memory allocation. A separate section is devoted to the description of these files thereby permitting the optimization of Input/Output (I/O) time through efficient buffer descriptions. Individual subroutine descriptions are presented along with the complete Fortran source listings. A short description of the major control, computation, and I/O phases is included to aid in obtaining an overall familiarity with the program's components. Finally, a discussion of the suggested overlay structure which allows the program to execute with a reasonable amount of memory allocation is presented.
A parallel implementation of a multisensor feature-based range-estimation method
NASA Technical Reports Server (NTRS)
Suorsa, Raymond E.; Sridhar, Banavar
1993-01-01
There are many proposed vision based methods to perform obstacle detection and avoidance for autonomous or semi-autonomous vehicles. All methods, however, will require very high processing rates to achieve real time performance. A system capable of supporting autonomous helicopter navigation will need to extract obstacle information from imagery at rates varying from ten frames per second to thirty or more frames per second depending on the vehicle speed. Such a system will need to sustain billions of operations per second. To reach such high processing rates using current technology, a parallel implementation of the obstacle detection/ranging method is required. This paper describes an efficient and flexible parallel implementation of a multisensor feature-based range-estimation algorithm, targeted for helicopter flight, realized on both a distributed-memory and shared-memory parallel computer.
Scalability improvements to NRLMOL for DFT calculations of large molecules
NASA Astrophysics Data System (ADS)
Diaz, Carlos Manuel
Advances in high performance computing (HPC) have provided a way to treat large, computationally demanding tasks using thousands of processors. With the development of more powerful HPC architectures, the need to create efficient and scalable code has grown more important. Electronic structure calculations are valuable in understanding experimental observations and are routinely used for new materials predictions. For the electronic structure calculations, the memory and computation time are proportional to the number of atoms. Memory requirements for these calculations scale as N2, where N is the number of atoms. While the recent advances in HPC offer platforms with large numbers of cores, the limited amount of memory available on a given node and poor scalability of the electronic structure code hinder their efficient usage of these platforms. This thesis will present some developments to overcome these bottlenecks in order to study large systems. These developments, which are implemented in the NRLMOL electronic structure code, involve the use of sparse matrix storage formats and the use of linear algebra using sparse and distributed matrices. These developments along with other related development now allow ground state density functional calculations using up to 25,000 basis functions and the excited state calculations using up to 17,000 basis functions while utilizing all cores on a node. An example on a light-harvesting triad molecule is described. Finally, future plans to further improve the scalability will be presented.
NASA Astrophysics Data System (ADS)
Cheng, Tian-Le; Ma, Fengde D.; Zhou, Jie E.; Jennings, Guy; Ren, Yang; Jin, Yongmei M.; Wang, Yu U.
2012-01-01
Diffuse scattering contains rich information on various structural disorders, thus providing a useful means to study the nanoscale structural deviations from the average crystal structures determined by Bragg peak analysis. Extraction of maximal information from diffuse scattering requires concerted efforts in high-quality three-dimensional (3D) data measurement, quantitative data analysis and visualization, theoretical interpretation, and computer simulations. Such an endeavor is undertaken to study the correlated dynamic atomic position fluctuations caused by thermal vibrations (phonons) in precursor state of shape-memory alloys. High-quality 3D diffuse scattering intensity data around representative Bragg peaks are collected by using in situ high-energy synchrotron x-ray diffraction and two-dimensional digital x-ray detector (image plate). Computational algorithms and codes are developed to construct the 3D reciprocal-space map of diffuse scattering intensity distribution from the measured data, which are further visualized and quantitatively analyzed to reveal in situ physical behaviors. Diffuse scattering intensity distribution is explicitly formulated in terms of atomic position fluctuations to interpret the experimental observations and identify the most relevant physical mechanisms, which help set up reduced structural models with minimal parameters to be efficiently determined by computer simulations. Such combined procedures are demonstrated by a study of phonon softening phenomenon in precursor state and premartensitic transformation of Ni-Mn-Ga shape-memory alloy.
Implementing Molecular Dynamics for Hybrid High Performance Computers - 1. Short Range Forces
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown, W Michael; Wang, Peng; Plimpton, Steven J
The use of accelerators such as general-purpose graphics processing units (GPGPUs) have become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high performance computers, machines with more than one type of floating-point processor, are now becoming more prevalent due to these advantages. In this work, we discuss several important issues in porting a large molecular dynamics code for use on parallel hybrid machines - 1) choosing a hybrid parallel decomposition that works on central processing units (CPUs) with distributed memory and accelerator cores with shared memory,more » 2) minimizing the amount of code that must be ported for efficient acceleration, 3) utilizing the available processing power from both many-core CPUs and accelerators, and 4) choosing a programming model for acceleration. We present our solution to each of these issues for short-range force calculation in the molecular dynamics package LAMMPS. We describe algorithms for efficient short range force calculation on hybrid high performance machines. We describe a new approach for dynamic load balancing of work between CPU and accelerator cores. We describe the Geryon library that allows a single code to compile with both CUDA and OpenCL for use on a variety of accelerators. Finally, we present results on a parallel test cluster containing 32 Fermi GPGPUs and 180 CPU cores.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Romanov, E.S.; Ivoilov, N.G.
A buffer memory unit and an interface for the UNO-4096-90 accumulator with an Elektronika D3-28 microcomputer are described that allow simultaneous recording of four Moessbauer spectra with zero dead time. For complete elimination of dead time, the pulses from each detector are fed to two buffer counters units, which operate alternately in the write and interrogate modes. This organization of the buffer memory also completely eliminates the effect of the sensors on one another. The use of these circuits does not require any modifications of the computer or accumulator.
Parallel discrete event simulation using shared memory
NASA Technical Reports Server (NTRS)
Reed, Daniel A.; Malony, Allen D.; Mccredie, Bradley D.
1988-01-01
With traditional event-list techniques, evaluating a detailed discrete-event simulation-model can often require hours or even days of computation time. By eliminating the event list and maintaining only sufficient synchronization to ensure causality, parallel simulation can potentially provide speedups that are linear in the numbers of processors. A set of shared-memory experiments, using the Chandy-Misra distributed-simulation algorithm, to simulate networks of queues is presented. Parameters of the study include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential-simulation of most queueing network models.
NASA Technical Reports Server (NTRS)
Smith, W. W.
1973-01-01
A Langley Research Center version of NASTRAN Level 15.1.0 designed to provide the analyst with an added tool for debugging massive NASTRAN input data is described. The program checks all NASTRAN input data cards and displays on a CRT the graphic representation of the undeformed structure. In addition, the program permits the display and alteration of input data and allows reexecution without physically resubmitting the job. Core requirements on the CDC 6000 computer are approximately 77,000 octal words of central memory.
System and method for programmable bank selection for banked memory subsystems
Blumrich, Matthias A.; Chen, Dong; Gara, Alan G.; Giampapa, Mark E.; Hoenicke, Dirk; Ohmacht, Martin; Salapura, Valentina; Sugavanam, Krishnan
2010-09-07
A programmable memory system and method for enabling one or more processor devices access to shared memory in a computing environment, the shared memory including one or more memory storage structures having addressable locations for storing data. The system comprises: one or more first logic devices associated with a respective one or more processor devices, each first logic device for receiving physical memory address signals and programmable for generating a respective memory storage structure select signal upon receipt of pre-determined address bit values at selected physical memory address bit locations; and, a second logic device responsive to each of the respective select signal for generating an address signal used for selecting a memory storage structure for processor access. The system thus enables each processor device of a computing environment memory storage access distributed across the one or more memory storage structures.
NASA Technical Reports Server (NTRS)
Reuther, James; Alonso, Juan Jose; Rimlinger, Mark J.; Jameson, Antony
1996-01-01
This work describes the application of a control theory-based aerodynamic shape optimization method to the problem of supersonic aircraft design. The design process is greatly accelerated through the use of both control theory and a parallel implementation on distributed memory computers. Control theory is employed to derive the adjoint differential equations whose solution allows for the evaluation of design gradient information at a fraction of the computational cost required by previous design methods (13, 12, 44, 38). The resulting problem is then implemented on parallel distributed memory architectures using a domain decomposition approach, an optimized communication schedule, and the MPI (Message Passing Interface) Standard for portability and efficiency. The final result achieves very rapid aerodynamic design based on higher order computational fluid dynamics methods (CFD). In our earlier studies, the serial implementation of this design method (19, 20, 21, 23, 39, 25, 40, 41, 42, 43, 9) was shown to be effective for the optimization of airfoils, wings, wing-bodies, and complex aircraft configurations using both the potential equation and the Euler equations (39, 25). In our most recent paper, the Euler method was extended to treat complete aircraft configurations via a new multiblock implementation. Furthermore, during the same conference, we also presented preliminary results demonstrating that the basic methodology could be ported to distributed memory parallel computing architectures [241. In this paper, our concem will be to demonstrate that the combined power of these new technologies can be used routinely in an industrial design environment by applying it to the case study of the design of typical supersonic transport configurations. A particular difficulty of this test case is posed by the propulsion/airframe integration.
Chan, Raymond C K; Wang, Ya; Ma, Zheng; Hong, Xiao-hong; Yuan, Yanbo; Yu, Xin; Li, Zhanjiang; Shum, David; Gong, Qi-yong
2008-08-01
While a number of studies have shown that individuals with schizophrenia are impaired on various types of prospective memory, few studies have examined the relationship between subjective and objective measures of this construct in this clinical group. The purpose of the current study was to explore the relationship between computer-based prospective memory tasks and the corresponding subjective complaints in patients with schizophrenia, individuals with schizotypal personality features, and healthy volunteers. The findings showed that patients with schizophrenia demonstrated significantly poorer performance in all domains of memory function except visual memory than individuals with schizotypal personality disorder and healthy controls. More importantly, there was a significant interaction effect of prospective memory type and group. Although patients with schizophrenia were found to show significantly poorer performance on computer-based measures of prospective memory than controls, their level of subjective complaint was not found to be significantly higher. While subjective complaints of prospective memory were found to associate significantly with self-reported executive dysfunctions, significant relationships were not found between these complaints and performance on a computer-based task of prospective memory and other objective measures of memory. Taken together, these findings suggest that subjective and objective measures of prospective memory are two distinct domains that might need to be assessed and addressed separately.
SIproc: an open-source biomedical data processing platform for large hyperspectral images.
Berisha, Sebastian; Chang, Shengyuan; Saki, Sam; Daeinejad, Davar; He, Ziqi; Mankar, Rupali; Mayerich, David
2017-04-10
There has recently been significant interest within the vibrational spectroscopy community to apply quantitative spectroscopic imaging techniques to histology and clinical diagnosis. However, many of the proposed methods require collecting spectroscopic images that have a similar region size and resolution to the corresponding histological images. Since spectroscopic images contain significantly more spectral samples than traditional histology, the resulting data sets can approach hundreds of gigabytes to terabytes in size. This makes them difficult to store and process, and the tools available to researchers for handling large spectroscopic data sets are limited. Fundamental mathematical tools, such as MATLAB, Octave, and SciPy, are extremely powerful but require that the data be stored in fast memory. This memory limitation becomes impractical for even modestly sized histological images, which can be hundreds of gigabytes in size. In this paper, we propose an open-source toolkit designed to perform out-of-core processing of hyperspectral images. By taking advantage of graphical processing unit (GPU) computing combined with adaptive data streaming, our software alleviates common workstation memory limitations while achieving better performance than existing applications.
A Linked List-Based Algorithm for Blob Detection on Embedded Vision-Based Sensors
Acevedo-Avila, Ricardo; Gonzalez-Mendoza, Miguel; Garcia-Garcia, Andres
2016-01-01
Blob detection is a common task in vision-based applications. Most existing algorithms are aimed at execution on general purpose computers; while very few can be adapted to the computing restrictions present in embedded platforms. This paper focuses on the design of an algorithm capable of real-time blob detection that minimizes system memory consumption. The proposed algorithm detects objects in one image scan; it is based on a linked-list data structure tree used to label blobs depending on their shape and node information. An example application showing the results of a blob detection co-processor has been built on a low-powered field programmable gate array hardware as a step towards developing a smart video surveillance system. The detection method is intended for general purpose application. As such, several test cases focused on character recognition are also examined. The results obtained present a fair trade-off between accuracy and memory requirements; and prove the validity of the proposed approach for real-time implementation on resource-constrained computing platforms. PMID:27240382
Dewaraja, Yuni K; Ljungberg, Michael; Majumdar, Amitava; Bose, Abhijit; Koral, Kenneth F
2002-02-01
This paper reports the implementation of the SIMIND Monte Carlo code on an IBM SP2 distributed memory parallel computer. Basic aspects of running Monte Carlo particle transport calculations on parallel architectures are described. Our parallelization is based on equally partitioning photons among the processors and uses the Message Passing Interface (MPI) library for interprocessor communication and the Scalable Parallel Random Number Generator (SPRNG) to generate uncorrelated random number streams. These parallelization techniques are also applicable to other distributed memory architectures. A linear increase in computing speed with the number of processors is demonstrated for up to 32 processors. This speed-up is especially significant in Single Photon Emission Computed Tomography (SPECT) simulations involving higher energy photon emitters, where explicit modeling of the phantom and collimator is required. For (131)I, the accuracy of the parallel code is demonstrated by comparing simulated and experimental SPECT images from a heart/thorax phantom. Clinically realistic SPECT simulations using the voxel-man phantom are carried out to assess scatter and attenuation correction.
NASA Astrophysics Data System (ADS)
Aigner, M.; Köpplmayr, T.; Kneidinger, C.; Miethlinger, J.
2014-05-01
Barrier screws are widely used in the plastics industry. Due to the extreme diversity of their geometries, describing the flow behavior is difficult and rarely done in practice. We present a systematic approach based on networks that uses tensor algebra and numerical methods to model and calculate selected barrier screw geometries in terms of pressure, mass flow, and residence time. In addition, we report the results of three-dimensional simulations using the commercially available ANSYS Polyflow software. The major drawbacks of three-dimensional finite-element-method (FEM) simulations are that they require vast computational power and, large quantities of memory, and consume considerable time to create a geometric model created by computer-aided design (CAD) and complete a flow calculation. Consequently, a modified 2.5-dimensional finite volume method, termed network analysis is preferable. The results obtained by network analysis and FEM simulations correlated well. Network analysis provides an efficient alternative to complex FEM software in terms of computing power and memory consumption. Furthermore, typical barrier screw geometries can be parameterized and used for flow calculations without timeconsuming CAD-constructions.
NASA Astrophysics Data System (ADS)
Erez, Mattan; Dally, William J.
Stream processors, like other multi core architectures partition their functional units and storage into multiple processing elements. In contrast to typical architectures, which contain symmetric general-purpose cores and a cache hierarchy, stream processors have a significantly leaner design. Stream processors are specifically designed for the stream execution model, in which applications have large amounts of explicit parallel computation, structured and predictable control, and memory accesses that can be performed at a coarse granularity. Applications in the streaming model are expressed in a gather-compute-scatter form, yielding programs with explicit control over transferring data to and from on-chip memory. Relying on these characteristics, which are common to many media processing and scientific computing applications, stream architectures redefine the boundary between software and hardware responsibilities with software bearing much of the complexity required to manage concurrency, locality, and latency tolerance. Thus, stream processors have minimal control consisting of fetching medium- and coarse-grained instructions and executing them directly on the many ALUs. Moreover, the on-chip storage hierarchy of stream processors is under explicit software control, as is all communication, eliminating the need for complex reactive hardware mechanisms.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Phipps, Eric T.; D'Elia, Marta; Edwards, Harold C.; ...
2017-04-18
In this study, quantifying simulation uncertainties is a critical component of rigorous predictive simulation. A key component of this is forward propagation of uncertainties in simulation input data to output quantities of interest. Typical approaches involve repeated sampling of the simulation over the uncertain input data, and can require numerous samples when accurately propagating uncertainties from large numbers of sources. Often simulation processes from sample to sample are similar and much of the data generated from each sample evaluation could be reused. We explore a new method for implementing sampling methods that simultaneously propagates groups of samples together in anmore » embedded fashion, which we call embedded ensemble propagation. We show how this approach takes advantage of properties of modern computer architectures to improve performance by enabling reuse between samples, reducing memory bandwidth requirements, improving memory access patterns, improving opportunities for fine-grained parallelization, and reducing communication costs. We describe a software technique for implementing embedded ensemble propagation based on the use of C++ templates and describe its integration with various scientific computing libraries within Trilinos. We demonstrate improved performance, portability and scalability for the approach applied to the simulation of partial differential equations on a variety of CPU, GPU, and accelerator architectures, including up to 131,072 cores on a Cray XK7 (Titan).« less
Computational modelling of memory retention from synapse to behaviour
NASA Astrophysics Data System (ADS)
van Rossum, Mark C. W.; Shippi, Maria
2013-03-01
One of our most intriguing mental abilities is the capacity to store information and recall it from memory. Computational neuroscience has been influential in developing models and concepts of learning and memory. In this tutorial review we focus on the interplay between learning and forgetting. We discuss recent advances in the computational description of the learning and forgetting processes on synaptic, neuronal, and systems levels, as well as recent data that open up new challenges for statistical physicists.
Design and Implementation of an MC68020-Based Educational Computer Board
1989-12-01
device and the other for a Macintosh personal computer. A stored program can be installed in 8K bytes Programmable Read Only Memory (PROM) to initialize...MHz. It includes four * Static Random Access Memory (SRAM) chips which provide a storage of 32K bytes. Two Programmable Array Logic (PAL) chips...device and the other for a Macintosh personal computer. A stored program can be installed in 8K bytes Programmable Read Only Memory (PROM) to
Correlation energy extrapolation by many-body expansion
Boschen, Jeffery S.; Theis, Daniel; Ruedenberg, Klaus; ...
2017-01-09
Accounting for electron correlation is required for high accuracy calculations of molecular energies. The full configuration interaction (CI) approach can fully capture the electron correlation within a given basis, but it does so at a computational expense that is impractical for all but the smallest chemical systems. In this work, a new methodology is presented to approximate configuration interaction calculations at a reduced computational expense and memory requirement, namely, the correlation energy extrapolation by many-body expansion (CEEMBE). This method combines a MBE approximation of the CI energy with an extrapolated correction obtained from CI calculations using subsets of the virtualmore » orbitals. The extrapolation approach is inspired by, and analogous to, the method of correlation energy extrapolation by intrinsic scaling. Benchmark calculations of the new method are performed on diatomic fluorine and ozone. Finally, the method consistently achieves agreement with CI calculations to within a few mhartree and often achieves agreement to within ~1 millihartree or less, while requiring significantly less computational resources.« less
Correlation energy extrapolation by many-body expansion
DOE Office of Scientific and Technical Information (OSTI.GOV)
Boschen, Jeffery S.; Theis, Daniel; Ruedenberg, Klaus
Accounting for electron correlation is required for high accuracy calculations of molecular energies. The full configuration interaction (CI) approach can fully capture the electron correlation within a given basis, but it does so at a computational expense that is impractical for all but the smallest chemical systems. In this work, a new methodology is presented to approximate configuration interaction calculations at a reduced computational expense and memory requirement, namely, the correlation energy extrapolation by many-body expansion (CEEMBE). This method combines a MBE approximation of the CI energy with an extrapolated correction obtained from CI calculations using subsets of the virtualmore » orbitals. The extrapolation approach is inspired by, and analogous to, the method of correlation energy extrapolation by intrinsic scaling. Benchmark calculations of the new method are performed on diatomic fluorine and ozone. Finally, the method consistently achieves agreement with CI calculations to within a few mhartree and often achieves agreement to within ~1 millihartree or less, while requiring significantly less computational resources.« less
Kinetic energy classification and smoothing for compact B-spline basis sets in quantum Monte Carlo
Krogel, Jaron T.; Reboredo, Fernando A.
2018-01-25
Quantum Monte Carlo calculations of defect properties of transition metal oxides have become feasible in recent years due to increases in computing power. As the system size has grown, availability of on-node memory has become a limiting factor. Saving memory while minimizing computational cost is now a priority. The main growth in memory demand stems from the B-spline representation of the single particle orbitals, especially for heavier elements such as transition metals where semi-core states are present. Despite the associated memory costs, splines are computationally efficient. In this paper, we explore alternatives to reduce the memory usage of splined orbitalsmore » without significantly affecting numerical fidelity or computational efficiency. We make use of the kinetic energy operator to both classify and smooth the occupied set of orbitals prior to splining. By using a partitioning scheme based on the per-orbital kinetic energy distributions, we show that memory savings of about 50% is possible for select transition metal oxide systems. Finally, for production supercells of practical interest, our scheme incurs a performance penalty of less than 5%.« less
Kinetic energy classification and smoothing for compact B-spline basis sets in quantum Monte Carlo
DOE Office of Scientific and Technical Information (OSTI.GOV)
Krogel, Jaron T.; Reboredo, Fernando A.
Quantum Monte Carlo calculations of defect properties of transition metal oxides have become feasible in recent years due to increases in computing power. As the system size has grown, availability of on-node memory has become a limiting factor. Saving memory while minimizing computational cost is now a priority. The main growth in memory demand stems from the B-spline representation of the single particle orbitals, especially for heavier elements such as transition metals where semi-core states are present. Despite the associated memory costs, splines are computationally efficient. In this paper, we explore alternatives to reduce the memory usage of splined orbitalsmore » without significantly affecting numerical fidelity or computational efficiency. We make use of the kinetic energy operator to both classify and smooth the occupied set of orbitals prior to splining. By using a partitioning scheme based on the per-orbital kinetic energy distributions, we show that memory savings of about 50% is possible for select transition metal oxide systems. Finally, for production supercells of practical interest, our scheme incurs a performance penalty of less than 5%.« less
Kinetic energy classification and smoothing for compact B-spline basis sets in quantum Monte Carlo
NASA Astrophysics Data System (ADS)
Krogel, Jaron T.; Reboredo, Fernando A.
2018-01-01
Quantum Monte Carlo calculations of defect properties of transition metal oxides have become feasible in recent years due to increases in computing power. As the system size has grown, availability of on-node memory has become a limiting factor. Saving memory while minimizing computational cost is now a priority. The main growth in memory demand stems from the B-spline representation of the single particle orbitals, especially for heavier elements such as transition metals where semi-core states are present. Despite the associated memory costs, splines are computationally efficient. In this work, we explore alternatives to reduce the memory usage of splined orbitals without significantly affecting numerical fidelity or computational efficiency. We make use of the kinetic energy operator to both classify and smooth the occupied set of orbitals prior to splining. By using a partitioning scheme based on the per-orbital kinetic energy distributions, we show that memory savings of about 50% is possible for select transition metal oxide systems. For production supercells of practical interest, our scheme incurs a performance penalty of less than 5%.
Aberg, Kristoffer C; Müller, Julia; Schwartz, Sophie
2017-01-01
Anticipation and delivery of rewards improves memory formation, but little effort has been made to disentangle their respective contributions to memory enhancement. Moreover, it has been suggested that the effects of reward on memory are mediated by dopaminergic influences on hippocampal plasticity. Yet, evidence linking memory improvements to actual reward computations reflected in the activity of the dopaminergic system, i.e., prediction errors and expected values, is scarce and inconclusive. For example, different previous studies reported that the magnitude of prediction errors during a reinforcement learning task was a positive, negative, or non-significant predictor of successfully encoding simultaneously presented images. Individual sensitivities to reward and punishment have been found to influence the activation of the dopaminergic reward system and could therefore help explain these seemingly discrepant results. Here, we used a novel associative memory task combined with computational modeling and showed independent effects of reward-delivery and reward-anticipation on memory. Strikingly, the computational approach revealed positive influences from both reward delivery, as mediated by prediction error magnitude, and reward anticipation, as mediated by magnitude of expected value, even in the absence of behavioral effects when analyzed using standard methods, i.e., by collapsing memory performance across trials within conditions. We additionally measured trait estimates of reward and punishment sensitivity and found that individuals with increased reward (vs. punishment) sensitivity had better memory for associations encoded during positive (vs. negative) prediction errors when tested after 20 min, but a negative trend when tested after 24 h. In conclusion, modeling trial-by-trial fluctuations in the magnitude of reward, as we did here for prediction errors and expected value computations, provides a comprehensive and biologically plausible description of the dynamic interplay between reward, dopamine, and associative memory formation. Our results also underline the importance of considering individual traits when assessing reward-related influences on memory.
Opportunities for nonvolatile memory systems in extreme-scale high-performance computing
Vetter, Jeffrey S.; Mittal, Sparsh
2015-01-12
For extreme-scale high-performance computing systems, system-wide power consumption has been identified as one of the key constraints moving forward, where DRAM main memory systems account for about 30 to 50 percent of a node's overall power consumption. As the benefits of device scaling for DRAM memory slow, it will become increasingly difficult to keep memory capacities balanced with increasing computational rates offered by next-generation processors. However, several emerging memory technologies related to nonvolatile memory (NVM) devices are being investigated as an alternative for DRAM. Moving forward, NVM devices could offer solutions for HPC architectures. Researchers are investigating how to integratemore » these emerging technologies into future extreme-scale HPC systems and how to expose these capabilities in the software stack and applications. In addition, current results show several of these strategies could offer high-bandwidth I/O, larger main memory capacities, persistent data structures, and new approaches for application resilience and output postprocessing, such as transaction-based incremental checkpointing and in situ visualization, respectively.« less
Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jin, Shuangshuang; Chen, Yousu; Wu, Di
2015-12-09
Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Messagemore » Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.« less
A Neural Network Architecture For Rapid Model Indexing In Computer Vision Systems
NASA Astrophysics Data System (ADS)
Pawlicki, Ted
1988-03-01
Models of objects stored in memory have been shown to be useful for guiding the processing of computer vision systems. A major consideration in such systems, however, is how stored models are initially accessed and indexed by the system. As the number of stored models increases, the time required to search memory for the correct model becomes high. Parallel distributed, connectionist, neural networks' have been shown to have appealing content addressable memory properties. This paper discusses an architecture for efficient storage and reference of model memories stored as stable patterns of activity in a parallel, distributed, connectionist, neural network. The emergent properties of content addressability and resistance to noise are exploited to perform indexing of the appropriate object centered model from image centered primitives. The system consists of three network modules each of which represent information relative to a different frame of reference. The model memory network is a large state space vector where fields in the vector correspond to ordered component objects and relative, object based spatial relationships between the component objects. The component assertion network represents evidence about the existence of object primitives in the input image. It establishes local frames of reference for object primitives relative to the image based frame of reference. The spatial relationship constraint network is an intermediate representation which enables the association between the object based and the image based frames of reference. This intermediate level represents information about possible object orderings and establishes relative spatial relationships from the image based information in the component assertion network below. It is also constrained by the lawful object orderings in the model memory network above. The system design is consistent with current psychological theories of recognition by component. It also seems to support Marr's notions of hierarchical indexing. (i.e. the specificity, adjunct, and parent indices) It supports the notion that multiple canonical views of an object may have to be stored in memory to enable its efficient identification. The use of variable fields in the state space vectors appears to keep the number of required nodes in the network down to a tractable number while imposing a semantic value on different areas of the state space. This semantic imposition supports an interface between the analogical aspects of neural networks and the propositional paradigms of symbolic processing.
Sparse distributed memory and related models
NASA Technical Reports Server (NTRS)
Kanerva, Pentti
1992-01-01
Described here is sparse distributed memory (SDM) as a neural-net associative memory. It is characterized by two weight matrices and by a large internal dimension - the number of hidden units is much larger than the number of input or output units. The first matrix, A, is fixed and possibly random, and the second matrix, C, is modifiable. The SDM is compared and contrasted to (1) computer memory, (2) correlation-matrix memory, (3) feet-forward artificial neural network, (4) cortex of the cerebellum, (5) Marr and Albus models of the cerebellum, and (6) Albus' cerebellar model arithmetic computer (CMAC). Several variations of the basic SDM design are discussed: the selected-coordinate and hyperplane designs of Jaeckel, the pseudorandom associative neural memory of Hassoun, and SDM with real-valued input variables by Prager and Fallside. SDM research conducted mainly at the Research Institute for Advanced Computer Science (RIACS) in 1986-1991 is highlighted.
Computation of Sensitivity Derivatives of Navier-Stokes Equations using Complex Variables
NASA Technical Reports Server (NTRS)
Vatsa, Veer N.
2004-01-01
Accurate computation of sensitivity derivatives is becoming an important item in Computational Fluid Dynamics (CFD) because of recent emphasis on using nonlinear CFD methods in aerodynamic design, optimization, stability and control related problems. Several techniques are available to compute gradients or sensitivity derivatives of desired flow quantities or cost functions with respect to selected independent (design) variables. Perhaps the most common and oldest method is to use straightforward finite-differences for the evaluation of sensitivity derivatives. Although very simple, this method is prone to errors associated with choice of step sizes and can be cumbersome for geometric variables. The cost per design variable for computing sensitivity derivatives with central differencing is at least equal to the cost of three full analyses, but is usually much larger in practice due to difficulty in choosing step sizes. Another approach gaining popularity is the use of Automatic Differentiation software (such as ADIFOR) to process the source code, which in turn can be used to evaluate the sensitivity derivatives of preselected functions with respect to chosen design variables. In principle, this approach is also very straightforward and quite promising. The main drawback is the large memory requirement because memory use increases linearly with the number of design variables. ADIFOR software can also be cumber-some for large CFD codes and has not yet reached a full maturity level for production codes, especially in parallel computing environments.
Ranhel, João
2012-06-01
Spiking neurons can realize several computational operations when firing cooperatively. This is a prevalent notion, although the mechanisms are not yet understood. A way by which neural assemblies compute is proposed in this paper. It is shown how neural coalitions represent things (and world states), memorize them, and control their hierarchical relations in order to perform algorithms. It is described how neural groups perform statistic logic functions as they form assemblies. Neural coalitions can reverberate, becoming bistable loops. Such bistable neural assemblies become short- or long-term memories that represent the event that triggers them. In addition, assemblies can branch and dismantle other neural groups generating new events that trigger other coalitions. Hence, such capabilities and the interaction among assemblies allow neural networks to create and control hierarchical cascades of causal activities, giving rise to parallel algorithms. Computing and algorithms are used here as in a nonstandard computation approach. In this sense, neural assembly computing (NAC) can be seen as a new class of spiking neural network machines. NAC can explain the following points: 1) how neuron groups represent things and states; 2) how they retain binary states in memories that do not require any plasticity mechanism; and 3) how branching, disbanding, and interaction among assemblies may result in algorithms and behavioral responses. Simulations were carried out and the results are in agreement with the hypothesis presented. A MATLAB code is available as a supplementary material.
The effects of mental representation on performance in a navigation task
NASA Technical Reports Server (NTRS)
Barshi, Immanuel; Healy, Alice F.
2002-01-01
In three experiments, we investigated the mental representations employed when instructions were followed that involved navigation in a space displayed as a grid on a computer screen. Performance was affected much more by the number of instructional units than by the number of words per unit. Performance in a three-dimensional space was independent of the number of dimensions along which participants navigated. However, memory for and accuracy in following the instructions were reduced when the task required mentally representing a three-dimensional space, as compared with representing a two-dimensional space, although the words used in the instructions were identical in the two cases. These results demonstrate the interdependence of verbal and spatial memory representations, because individuals' immediate memory for verbal navigation instructions is affected by their mental representation of the space referred to by the instructions.
Choi, Koeun; Kirkorian, Heather L; Pempek, Tiffany A
2017-04-17
Researchers tested the impact of contextual mismatch, proactive interference, and working memory (WM) on toddlers' transfer across contexts. Forty-two toddlers (27-34 months) completed four object-retrieval trials, requiring memory updating on Trials 2-4. Participants watched hiding events on a tablet computer. Search performance was tested using another tablet (match) or a felt board (mismatch). WM was assessed. On earlier search trials, WM predicted transfer in both conditions, and toddlers in the match condition outperformed those in the mismatch condition; however, the benefit of contextual match and WM decreased over trials. Contextual match apparently increased proactive interference on later trials. Findings are interpreted within existing accounts of the transfer deficit, and a combined account is proposed. © 2017 The Authors. Child Development © 2017 Society for Research in Child Development, Inc.
Space-Bounded Church-Turing Thesis and Computational Tractability of Closed Systems.
Braverman, Mark; Schneider, Jonathan; Rojas, Cristóbal
2015-08-28
We report a new limitation on the ability of physical systems to perform computation-one that is based on generalizing the notion of memory, or storage space, available to the system to perform the computation. Roughly, we define memory as the maximal amount of information that the evolving system can carry from one instant to the next. We show that memory is a limiting factor in computation even in lieu of any time limitations on the evolving system-such as when considering its equilibrium regime. We call this limitation the space-bounded Church-Turing thesis (SBCT). The SBCT is supported by a simulation assertion (SA), which states that predicting the long-term behavior of bounded-memory systems is computationally tractable. In particular, one corollary of SA is an explicit bound on the computational hardness of the long-term behavior of a discrete-time finite-dimensional dynamical system that is affected by noise. We prove such a bound explicitly.
Passive TCP Reconstruction and Forensic Analysis with tcpflow
2013-09-01
Detection Evaluation [17]. 6 Future Work The original tcpflow was developed at a time when computers had relatively small memories and, as a result...4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 6...1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 Requirements
WinHPC System Policies | High-Performance Computing | NREL
requiring high CPU utilization or large amounts of memory should be run on the worker nodes. WinHPC02 is not associated data are removed when NREL worker status is discontinued. Users should make arrangements to save other users. Licenses are returned to the license pool when other users close the application or after
Experimental test of Landauer’s principle in single-bit operations on nanomagnetic memory bits
Hong, Jeongmin; Lambson, Brian; Dhuey, Scott; Bokor, Jeffrey
2016-01-01
Minimizing energy dissipation has emerged as the key challenge in continuing to scale the performance of digital computers. The question of whether there exists a fundamental lower limit to the energy required for digital operations is therefore of great interest. A well-known theoretical result put forward by Landauer states that any irreversible single-bit operation on a physical memory element in contact with a heat bath at a temperature T requires at least kBT ln(2) of heat be dissipated from the memory into the environment, where kB is the Boltzmann constant. We report an experimental investigation of the intrinsic energy loss of an adiabatic single-bit reset operation using nanoscale magnetic memory bits, by far the most ubiquitous digital storage technology in use today. Through sensitive, high-precision magnetometry measurements, we observed that the amount of dissipated energy in this process is consistent (within 2 SDs of experimental uncertainty) with the Landauer limit. This result reinforces the connection between “information thermodynamics” and physical systems and also provides a foundation for the development of practical information processing technologies that approach the fundamental limit of energy dissipation. The significance of the result includes insightful direction for future development of information technology. PMID:26998519
NASA Astrophysics Data System (ADS)
Mohan, C.
In this paper, I survey briefly some of the recent and emerging trends in hardware and software features which impact high performance transaction processing and data analytics applications. These features include multicore processor chips, ultra large main memories, flash storage, storage class memories, database appliances, field programmable gate arrays, transactional memory, key-value stores, and cloud computing. While some applications, e.g., Web 2.0 ones, were initially built without traditional transaction processing functionality in mind, slowly system architects and designers are beginning to address such previously ignored issues. The availability, analytics and response time requirements of these applications were initially given more importance than ACID transaction semantics and resource consumption characteristics. A project at IBM Almaden is studying the implications of phase change memory on transaction processing, in the context of a key-value store. Bitemporal data management has also become an important requirement, especially for financial applications. Power consumption and heat dissipation properties are also major considerations in the emergence of modern software and hardware architectural features. Considerations relating to ease of configuration, installation, maintenance and monitoring, and improvement of total cost of ownership have resulted in database appliances becoming very popular. The MapReduce paradigm is now quite popular for large scale data analysis, in spite of the major inefficiencies associated with it.
Evaluation of personal digital assistant drug information databases for the managed care pharmacist.
Lowry, Colleen M; Kostka-Rokosz, Maria D; McCloskey, William W
2003-01-01
Personal digital assistants (PDAs) are becoming a necessity for practicing pharmacists. They offer a time-saving and convenient way to obtain current drug information. Several software companies now offer general drug information databases for use on hand held computers. PDAs priced less than 200 US dollars often have limited memory capacity; therefore, the user must choose from a growing list of general drug information database options in order to maximize utility without exceeding memory capacity. This paper reviews the attributes of available general drug information software databases for the PDA. It provides information on the content, advantages, limitations, pricing, memory requirements, and accessibility of drug information software databases. Ten drug information databases were subjectively analyzed and evaluated based on information from the product.s Web site, vendor Web sites, and from our experience. Some of these databases have attractive auxiliary features such as kinetics calculators, disease references, drug-drug and drug-herb interaction tools, and clinical guidelines, which may make them more useful to the PDA user. Not all drug information databases are equal with regard to content, author credentials, frequency of updates, and memory requirements. The user must therefore evaluate databases for completeness, currency, and cost effectiveness before purchase. In addition, consideration should be given to the ease of use and flexibility of individual programs.
Holographic memory system based on projection recording of computer-generated 1D Fourier holograms.
Betin, A Yu; Bobrinev, V I; Donchenko, S S; Odinokov, S B; Evtikhiev, N N; Starikov, R S; Starikov, S N; Zlokazov, E Yu
2014-10-01
Utilization of computer generation of holographic structures significantly simplifies the optical scheme that is used to record the microholograms in a holographic memory record system. Also digital holographic synthesis allows to account the nonlinear errors of the record system to improve the microholograms quality. The multiplexed record of holograms is a widespread technique to increase the data record density. In this article we represent the holographic memory system based on digital synthesis of amplitude one-dimensional (1D) Fourier transform holograms and the multiplexed record of these holograms onto the holographic carrier using optical projection scheme. 1D Fourier transform holograms are very sensitive to orientation of the anamorphic optical element (cylindrical lens) that is required for encoded data object reconstruction. The multiplex record of several holograms with different orientation in an optical projection scheme allowed reconstruction of the data object from each hologram by rotating the cylindrical lens on the corresponding angle. Also, we discuss two optical schemes for the recorded holograms readout: a full-page readout system and line-by-line readout system. We consider the benefits of both systems and present the results of experimental modeling of 1D Fourier holograms nonmultiplex and multiplex record and reconstruction.
Efficient Algorithms for Estimating the Absorption Spectrum within Linear Response TDDFT
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brabec, Jiri; Lin, Lin; Shao, Meiyue
We present a special symmetric Lanczos algorithm and a kernel polynomial method (KPM) for approximating the absorption spectrum of molecules within the linear response time-dependent density functional theory (TDDFT) framework in the product form. In contrast to existing algorithms, the new algorithms are based on reformulating the original non-Hermitian eigenvalue problem as a product eigenvalue problem and the observation that the product eigenvalue problem is self-adjoint with respect to an appropriately chosen inner product. This allows a simple symmetric Lanczos algorithm to be used to compute the desired absorption spectrum. The use of a symmetric Lanczos algorithm only requires halfmore » of the memory compared with the nonsymmetric variant of the Lanczos algorithm. The symmetric Lanczos algorithm is also numerically more stable than the nonsymmetric version. The KPM algorithm is also presented as a low-memory alternative to the Lanczos approach, but the algorithm may require more matrix-vector multiplications in practice. We discuss the pros and cons of these methods in terms of their accuracy as well as their computational and storage cost. Applications to a set of small and medium-sized molecules are also presented.« less
Efficient Algorithms for Estimating the Absorption Spectrum within Linear Response TDDFT
Brabec, Jiri; Lin, Lin; Shao, Meiyue; ...
2015-10-06
We present a special symmetric Lanczos algorithm and a kernel polynomial method (KPM) for approximating the absorption spectrum of molecules within the linear response time-dependent density functional theory (TDDFT) framework in the product form. In contrast to existing algorithms, the new algorithms are based on reformulating the original non-Hermitian eigenvalue problem as a product eigenvalue problem and the observation that the product eigenvalue problem is self-adjoint with respect to an appropriately chosen inner product. This allows a simple symmetric Lanczos algorithm to be used to compute the desired absorption spectrum. The use of a symmetric Lanczos algorithm only requires halfmore » of the memory compared with the nonsymmetric variant of the Lanczos algorithm. The symmetric Lanczos algorithm is also numerically more stable than the nonsymmetric version. The KPM algorithm is also presented as a low-memory alternative to the Lanczos approach, but the algorithm may require more matrix-vector multiplications in practice. We discuss the pros and cons of these methods in terms of their accuracy as well as their computational and storage cost. Applications to a set of small and medium-sized molecules are also presented.« less
Introducing memory and association mechanism into a biologically inspired visual model.
Qiao, Hong; Li, Yinlin; Tang, Tang; Wang, Peng
2014-09-01
A famous biologically inspired hierarchical model (HMAX model), which was proposed recently and corresponds to V1 to V4 of the ventral pathway in primate visual cortex, has been successfully applied to multiple visual recognition tasks. The model is able to achieve a set of position- and scale-tolerant recognition, which is a central problem in pattern recognition. In this paper, based on some other biological experimental evidence, we introduce the memory and association mechanism into the HMAX model. The main contributions of the work are: 1) mimicking the active memory and association mechanism and adding the top down adjustment to the HMAX model, which is the first try to add the active adjustment to this famous model and 2) from the perspective of information, algorithms based on the new model can reduce the computation storage and have a good recognition performance. The new model is also applied to object recognition processes. The primary experimental results show that our method is efficient with a much lower memory requirement.
NASA Astrophysics Data System (ADS)
Jin, Jeongwan; Slater, Joshua A.; Saglamyurek, Erhan; Sinclair, Neil; George, Mathew; Ricken, Raimund; Oblak, Daniel; Sohler, Wolfgang; Tittel, Wolfgang
2013-08-01
Quantum memories allowing reversible transfer of quantum states between light and matter are central to quantum repeaters, quantum networks and linear optics quantum computing. Significant progress regarding the faithful transfer of quantum information has been reported in recent years. However, none of these demonstrations confirm that the re-emitted photons remain suitable for two-photon interference measurements, such as C-NOT gates and Bell-state measurements, which constitute another key ingredient for all aforementioned applications. Here, using pairs of laser pulses at the single-photon level, we demonstrate two-photon interference and Bell-state measurements after either none, one or both pulses have been reversibly mapped to separate thulium-doped lithium niobate waveguides. As the interference is always near the theoretical maximum, we conclude that our solid-state quantum memories, in addition to faithfully mapping quantum information, also preserve the entire photonic wavefunction. Hence, our memories are generally suitable for future applications of quantum information processing that require two-photon interference.
Jin, Jeongwan; Slater, Joshua A; Saglamyurek, Erhan; Sinclair, Neil; George, Mathew; Ricken, Raimund; Oblak, Daniel; Sohler, Wolfgang; Tittel, Wolfgang
2013-01-01
Quantum memories allowing reversible transfer of quantum states between light and matter are central to quantum repeaters, quantum networks and linear optics quantum computing. Significant progress regarding the faithful transfer of quantum information has been reported in recent years. However, none of these demonstrations confirm that the re-emitted photons remain suitable for two-photon interference measurements, such as C-NOT gates and Bell-state measurements, which constitute another key ingredient for all aforementioned applications. Here, using pairs of laser pulses at the single-photon level, we demonstrate two-photon interference and Bell-state measurements after either none, one or both pulses have been reversibly mapped to separate thulium-doped lithium niobate waveguides. As the interference is always near the theoretical maximum, we conclude that our solid-state quantum memories, in addition to faithfully mapping quantum information, also preserve the entire photonic wavefunction. Hence, our memories are generally suitable for future applications of quantum information processing that require two-photon interference.
NASA Astrophysics Data System (ADS)
Lou, Tak Pui; Ludewigt, Bernhard
2015-09-01
The simulation of the emission of beta-delayed gamma rays following nuclear fission and the calculation of time-dependent energy spectra is a computational challenge. The widely used radiation transport code MCNPX includes a delayed gamma-ray routine that is inefficient and not suitable for simulating complex problems. This paper describes the code "MMAPDNG" (Memory-Mapped Delayed Neutron and Gamma), an optimized delayed gamma module written in C, discusses usage and merits of the code, and presents results. The approach is based on storing required Fission Product Yield (FPY) data, decay data, and delayed particle data in a memory-mapped file. When compared to the original delayed gamma-ray code in MCNPX, memory utilization is reduced by two orders of magnitude and the ray sampling is sped up by three orders of magnitude. Other delayed particles such as neutrons and electrons can be implemented in future versions of MMAPDNG code using its existing framework.
Synthetic analog and digital circuits for cellular computation and memory.
Purcell, Oliver; Lu, Timothy K
2014-10-01
Biological computation is a major area of focus in synthetic biology because it has the potential to enable a wide range of applications. Synthetic biologists have applied engineering concepts to biological systems in order to construct progressively more complex gene circuits capable of processing information in living cells. Here, we review the current state of computational genetic circuits and describe artificial gene circuits that perform digital and analog computation. We then discuss recent progress in designing gene networks that exhibit memory, and how memory and computation have been integrated to yield more complex systems that can both process and record information. Finally, we suggest new directions for engineering biological circuits capable of computation. Copyright © 2014 The Authors. Published by Elsevier Ltd.. All rights reserved.
Efficient Memory Access with NumPy Global Arrays using Local Memory Access
DOE Office of Scientific and Technical Information (OSTI.GOV)
Daily, Jeffrey A.; Berghofer, Dan C.
This paper discusses the work completed working with Global Arrays of data on distributed multi-computer systems and improving their performance. The tasks completed were done at Pacific Northwest National Laboratory in the Science Undergrad Laboratory Internship program in the summer of 2013 for the Data Intensive Computing Group in the Fundamental and Computational Sciences DIrectorate. This work was done on the Global Arrays Toolkit developed by this group. This toolkit is an interface for programmers to more easily create arrays of data on networks of computers. This is useful because scientific computation is often done on large amounts of datamore » sometimes so large that individual computers cannot hold all of it. This data is held in array form and can best be processed on supercomputers which often consist of a network of individual computers doing their computation in parallel. One major challenge for this sort of programming is that operations on arrays on multiple computers is very complex and an interface is needed so that these arrays seem like they are on a single computer. This is what global arrays does. The work done here is to use more efficient operations on that data that requires less copying of data to be completed. This saves a lot of time because copying data on many different computers is time intensive. The way this challenge was solved is when data to be operated on with binary operations are on the same computer, they are not copied when they are accessed. When they are on separate computers, only one set is copied when accessed. This saves time because of less copying done although more data access operations were done.« less
Data storage technology comparisons
NASA Technical Reports Server (NTRS)
Katti, Romney R.
1990-01-01
The role of data storage and data storage technology is an integral, though conceptually often underestimated, portion of data processing technology. Data storage is important in the mass storage mode in which generated data is buffered for later use. But data storage technology is also important in the data flow mode when data are manipulated and hence required to flow between databases, datasets and processors. This latter mode is commonly associated with memory hierarchies which support computation. VLSI devices can reasonably be defined as electronic circuit devices such as channel and control electronics as well as highly integrated, solid-state devices that are fabricated using thin film deposition technology. VLSI devices in both capacities play an important role in data storage technology. In addition to random access memories (RAM), read-only memories (ROM), and other silicon-based variations such as PROM's, EPROM's, and EEPROM's, integrated devices find their way into a variety of memory technologies which offer significant performance advantages. These memory technologies include magnetic tape, magnetic disk, magneto-optic disk, and vertical Bloch line memory. In this paper, some comparison between selected technologies will be made to demonstrate why more than one memory technology exists today, based for example on access time and storage density at the active bit and system levels.
Progress In Optical Memory Technology
NASA Astrophysics Data System (ADS)
Tsunoda, Yoshito
1987-01-01
More than 20 years have passed since the concept of optical memory was first proposed in 1966. Since then considerable progress has been made in this area together with the creation of completely new markets of optical memory in consumer and computer application areas. The first generation of optical memory was mainly developed with holographic recording technology in late 1960s and early 1970s. Considerable number of developments have been done in both analog and digital memory applications. Unfortunately, these technologies did not meet a chance to be a commercial product. The second generation of optical memory started at the beginning of 1970s with bit by bit recording technology. Read-only type optical memories such as video disks and compact audio disks have extensively investigated. Since laser diodes were first applied to optical video disk read out in 1976, there have been extensive developments of laser diode pick-ups for optical disk memory systems. The third generation of optical memory started in 1978 with bit by bit read/write technology using laser diodes. Developments of recording materials including both write-once and erasable have been actively pursued at several research institutes. These technologies are mainly focused on the optical memory systems for computer application. Such practical applications of optical memory technology has resulted in the creation of such new products as compact audio disks and computer file memories.
An algorithm of discovering signatures from DNA databases on a computer cluster.
Lee, Hsiao Ping; Sheu, Tzu-Fang
2014-10-05
Signatures are short sequences that are unique and not similar to any other sequence in a database that can be used as the basis to identify different species. Even though several signature discovery algorithms have been proposed in the past, these algorithms require the entirety of databases to be loaded in the memory, thus restricting the amount of data that they can process. It makes those algorithms unable to process databases with large amounts of data. Also, those algorithms use sequential models and have slower discovery speeds, meaning that the efficiency can be improved. In this research, we are debuting the utilization of a divide-and-conquer strategy in signature discovery and have proposed a parallel signature discovery algorithm on a computer cluster. The algorithm applies the divide-and-conquer strategy to solve the problem posed to the existing algorithms where they are unable to process large databases and uses a parallel computing mechanism to effectively improve the efficiency of signature discovery. Even when run with just the memory of regular personal computers, the algorithm can still process large databases such as the human whole-genome EST database which were previously unable to be processed by the existing algorithms. The algorithm proposed in this research is not limited by the amount of usable memory and can rapidly find signatures in large databases, making it useful in applications such as Next Generation Sequencing and other large database analysis and processing. The implementation of the proposed algorithm is available at http://www.cs.pu.edu.tw/~fang/DDCSDPrograms/DDCSD.htm.
On-the-fly Doppler broadening of unresolved resonance region cross sections
Walsh, Jonathan A.; Forget, Benoit; Smith, Kord S.; ...
2017-07-29
In this paper, two methods for computing temperature-dependent unresolved resonance region cross sections on-the-fly within continuous-energy Monte Carlo neutron transport simulations are presented. The first method calculates Doppler broadened cross sections directly from zero-temperature average resonance parameters. In a simulation, at each event that requires cross section values, a realization of unresolved resonance parameters is generated about the desired energy and temperature-dependent single-level Breit-Wigner resonance cross sections are computed directly via the analytical Ψ-x Doppler integrals. The second method relies on the generation of equiprobable cross section magnitude bands on an energy-temperature mesh. Within a simulation, the bands are sampledmore » and interpolated in energy and temperature to obtain cross section values on-the-fly. Both of the methods, as well as their underlying calculation procedures, are verified numerically in extensive code-to-code comparisons. Energy-dependent pointwise cross sections calculated with the newly-implemented procedures are shown to be in excellent agreement with those calculated by a widely-used nuclear data processing code. Relative differences at or below 0.1% are observed. Integral criticality benchmark results computed with the proposed methods are shown to reproduce those computed with a state-of-the-art processed nuclear data library very well. In simulations of fast spectrum systems which are highly-sensitive to the representation of cross section data in the unresolved region, k-eigenvalue and neutron flux spectra differences of <10 pcm and <1.0% are observed, respectively. The direct method is demonstrated to be well-suited to the calculation of reference solutions — against which results obtained with a discretized representation may be assessed — as a result of its treatment of the energy, temperature, and cross section magnitude variables as continuous. Also, because there is no pre-processed data to store (only temperature-independent average resonance parameters) the direct method is very memory-efficient. Typically, only a few kB of memory are needed to store all required unresolved region data for a single nuclide. However, depending on the details of a particular simulation, performing URR cross section calculations on-the-fly can significantly increase simulation times. Alternatively, the method of interpolating equiprobable probability bands is demonstrated to produce results that are as accurate as the direct reference solutions, to within arbitrary precision, with high computational efficiency in terms of memory requirements and simulation time. Analyses of a fast spectrum system show that interpolation on a coarse energy-temperature mesh can be used to reproduce reference k-eigenvalue results obtained with cross sections calculated continuously in energy and directly at an exact temperature to within <10 pcm. Probability band data on a mesh encompassing the range of temperatures relevant to reactor analysis usually require around 100 kB of memory per nuclide. Finally, relative to the case in which probability table data generated at a single, desired temperature are used, minor increases in simulation times are observed when probability band interpolation is employed.« less
Flowfield computation of entry vehicles
NASA Technical Reports Server (NTRS)
Prabhu, Dinesh K.
1990-01-01
The equations governing the multidimensional flow of a reacting mixture of thermally perfect gasses were derived. The modeling procedures for the various terms of the conservation laws are discussed. A numerical algorithm, based on the finite-volume approach, to solve these conservation equations was developed. The advantages and disadvantages of the present numerical scheme are discussed from the point of view of accuracy, computer time, and memory requirements. A simple one-dimensional model problem was solved to prove the feasibility and accuracy of the algorithm. A computer code implementing the above algorithm was developed and is presently being applied to simple geometries and conditions. Once the code is completely debugged and validated, it will be used to compute the complete unsteady flow field around the Aeroassist Flight Experiment (AFE) body.
Causal learning with local computations.
Fernbach, Philip M; Sloman, Steven A
2009-05-01
The authors proposed and tested a psychological theory of causal structure learning based on local computations. Local computations simplify complex learning problems via cues available on individual trials to update a single causal structure hypothesis. Structural inferences from local computations make minimal demands on memory, require relatively small amounts of data, and need not respect normative prescriptions as inferences that are principled locally may violate those principles when combined. Over a series of 3 experiments, the authors found (a) systematic inferences from small amounts of data; (b) systematic inference of extraneous causal links; (c) influence of data presentation order on inferences; and (d) error reduction through pretraining. Without pretraining, a model based on local computations fitted data better than a Bayesian structural inference model. The data suggest that local computations serve as a heuristic for learning causal structure. Copyright 2009 APA, all rights reserved.
Performance and scalability evaluation of "Big Memory" on Blue Gene Linux.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yoshii, K.; Iskra, K.; Naik, H.
2011-05-01
We address memory performance issues observed in Blue Gene Linux and discuss the design and implementation of 'Big Memory' - an alternative, transparent memory space introduced to eliminate the memory performance issues. We evaluate the performance of Big Memory using custom memory benchmarks, NAS Parallel Benchmarks, and the Parallel Ocean Program, at a scale of up to 4,096 nodes. We find that Big Memory successfully resolves the performance issues normally encountered in Blue Gene Linux. For the ocean simulation program, we even find that Linux with Big Memory provides better scalability than does the lightweight compute node kernel designed solelymore » for high-performance applications. Originally intended exclusively for compute node tasks, our new memory subsystem dramatically improves the performance of certain I/O node applications as well. We demonstrate this performance using the central processor of the LOw Frequency ARray radio telescope as an example.« less
Low latency and persistent data storage
Fitch, Blake G; Franceschini, Michele M; Jagmohan, Ashish; Takken, Todd
2014-11-04
Persistent data storage is provided by a computer program product that includes computer program code configured for receiving a low latency store command that includes write data. The write data is written to a first memory device that is implemented by a nonvolatile solid-state memory technology characterized by a first access speed. It is acknowledged that the write data has been successfully written to the first memory device. The write data is written to a second memory device that is implemented by a volatile memory technology. At least a portion of the data in the first memory device is written to a third memory device when a predetermined amount of data has been accumulated in the first memory device. The third memory device is implemented by a nonvolatile solid-state memory technology characterized by a second access speed that is slower than the first access speed.
A simplified computational memory model from information processing.
Zhang, Lanhua; Zhang, Dongsheng; Deng, Yuqin; Ding, Xiaoqian; Wang, Yan; Tang, Yiyuan; Sun, Baoliang
2016-11-23
This paper is intended to propose a computational model for memory from the view of information processing. The model, called simplified memory information retrieval network (SMIRN), is a bi-modular hierarchical functional memory network by abstracting memory function and simulating memory information processing. At first meta-memory is defined to express the neuron or brain cortices based on the biology and graph theories, and we develop an intra-modular network with the modeling algorithm by mapping the node and edge, and then the bi-modular network is delineated with intra-modular and inter-modular. At last a polynomial retrieval algorithm is introduced. In this paper we simulate the memory phenomena and functions of memorization and strengthening by information processing algorithms. The theoretical analysis and the simulation results show that the model is in accordance with the memory phenomena from information processing view.
Repeat-until-success cubic phase gate for universal continuous-variable quantum computation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Marshall, Kevin; Pooser, Raphael; Siopsis, George
2015-03-24
We report that to achieve universal quantum computation using continuous variables, one needs to jump out of the set of Gaussian operations and have a non-Gaussian element, such as the cubic phase gate. However, such a gate is currently very difficult to implement in practice. Here we introduce an experimentally viable “repeat-until-success” approach to generating the cubic phase gate, which is achieved using sequential photon subtractions and Gaussian operations. Ultimately, we find that our scheme offers benefits in terms of the expected time until success, as well as the fact that we do not require any complex off-line resource state,more » although we require a primitive quantum memory.« less
A fast collocation method for a variable-coefficient nonlocal diffusion model
NASA Astrophysics Data System (ADS)
Wang, Che; Wang, Hong
2017-02-01
We develop a fast collocation scheme for a variable-coefficient nonlocal diffusion model, for which a numerical discretization would yield a dense stiffness matrix. The development of the fast method is achieved by carefully handling the variable coefficients appearing inside the singular integral operator and exploiting the structure of the dense stiffness matrix. The resulting fast method reduces the computational work from O (N3) required by a commonly used direct solver to O (Nlog N) per iteration and the memory requirement from O (N2) to O (N). Furthermore, the fast method reduces the computational work of assembling the stiffness matrix from O (N2) to O (N). Numerical results are presented to show the utility of the fast method.
Development and application of GASP 2.0
NASA Technical Reports Server (NTRS)
Mcgrory, W. D.; Huebner, L. D.; Slack, D. C.; Walters, R. W.
1992-01-01
GASP 2.0 represents a major new release of the computational fluid dynamics code in wide use by the aerospace community. The authors have spent the last two years analyzing the strengths and weaknesses of the previous version of the finite-rate chemistry, Navier Stokes solution algorithm. What has resulted is a completely redesigned computer code that offers two to four times the performance of previous versions while requiring as little as one quarter of the memory requirements. In addition to the improvements in efficiency over the original code, Version 2.0 contains many new features. A brief discussion of the improvements made to GASP, and an application using GASP 2.0 which demonstrates some of the new features are presented.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yang, Y. M., E-mail: ymingy@gmail.com; Bednarz, B.; Svatos, M.
Purpose: The future of radiation therapy will require advanced inverse planning solutions to support single-arc, multiple-arc, and “4π” delivery modes, which present unique challenges in finding an optimal treatment plan over a vast search space, while still preserving dosimetric accuracy. The successful clinical implementation of such methods would benefit from Monte Carlo (MC) based dose calculation methods, which can offer improvements in dosimetric accuracy when compared to deterministic methods. The standard method for MC based treatment planning optimization leverages the accuracy of the MC dose calculation and efficiency of well-developed optimization methods, by precalculating the fluence to dose relationship withinmore » a patient with MC methods and subsequently optimizing the fluence weights. However, the sequential nature of this implementation is computationally time consuming and memory intensive. Methods to reduce the overhead of the MC precalculation have been explored in the past, demonstrating promising reductions of computational time overhead, but with limited impact on the memory overhead due to the sequential nature of the dose calculation and fluence optimization. The authors propose an entirely new form of “concurrent” Monte Carlo treat plan optimization: a platform which optimizes the fluence during the dose calculation, reduces wasted computation time being spent on beamlets that weakly contribute to the final dose distribution, and requires only a low memory footprint to function. In this initial investigation, the authors explore the key theoretical and practical considerations of optimizing fluence in such a manner. Methods: The authors present a novel derivation and implementation of a gradient descent algorithm that allows for optimization during MC particle transport, based on highly stochastic information generated through particle transport of very few histories. A gradient rescaling and renormalization algorithm, and the concept of momentum from stochastic gradient descent were used to address obstacles unique to performing gradient descent fluence optimization during MC particle transport. The authors have applied their method to two simple geometrical phantoms, and one clinical patient geometry to examine the capability of this platform to generate conformal plans as well as assess its computational scaling and efficiency, respectively. Results: The authors obtain a reduction of at least 50% in total histories transported in their investigation compared to a theoretical unweighted beamlet calculation and subsequent fluence optimization method, and observe a roughly fixed optimization time overhead consisting of ∼10% of the total computation time in all cases. Finally, the authors demonstrate a negligible increase in memory overhead of ∼7–8 MB to allow for optimization of a clinical patient geometry surrounded by 36 beams using their platform. Conclusions: This study demonstrates a fluence optimization approach, which could significantly improve the development of next generation radiation therapy solutions while incurring minimal additional computational overhead.« less
Svatos, M.; Zankowski, C.; Bednarz, B.
2016-01-01
Purpose: The future of radiation therapy will require advanced inverse planning solutions to support single-arc, multiple-arc, and “4π” delivery modes, which present unique challenges in finding an optimal treatment plan over a vast search space, while still preserving dosimetric accuracy. The successful clinical implementation of such methods would benefit from Monte Carlo (MC) based dose calculation methods, which can offer improvements in dosimetric accuracy when compared to deterministic methods. The standard method for MC based treatment planning optimization leverages the accuracy of the MC dose calculation and efficiency of well-developed optimization methods, by precalculating the fluence to dose relationship within a patient with MC methods and subsequently optimizing the fluence weights. However, the sequential nature of this implementation is computationally time consuming and memory intensive. Methods to reduce the overhead of the MC precalculation have been explored in the past, demonstrating promising reductions of computational time overhead, but with limited impact on the memory overhead due to the sequential nature of the dose calculation and fluence optimization. The authors propose an entirely new form of “concurrent” Monte Carlo treat plan optimization: a platform which optimizes the fluence during the dose calculation, reduces wasted computation time being spent on beamlets that weakly contribute to the final dose distribution, and requires only a low memory footprint to function. In this initial investigation, the authors explore the key theoretical and practical considerations of optimizing fluence in such a manner. Methods: The authors present a novel derivation and implementation of a gradient descent algorithm that allows for optimization during MC particle transport, based on highly stochastic information generated through particle transport of very few histories. A gradient rescaling and renormalization algorithm, and the concept of momentum from stochastic gradient descent were used to address obstacles unique to performing gradient descent fluence optimization during MC particle transport. The authors have applied their method to two simple geometrical phantoms, and one clinical patient geometry to examine the capability of this platform to generate conformal plans as well as assess its computational scaling and efficiency, respectively. Results: The authors obtain a reduction of at least 50% in total histories transported in their investigation compared to a theoretical unweighted beamlet calculation and subsequent fluence optimization method, and observe a roughly fixed optimization time overhead consisting of ∼10% of the total computation time in all cases. Finally, the authors demonstrate a negligible increase in memory overhead of ∼7–8 MB to allow for optimization of a clinical patient geometry surrounded by 36 beams using their platform. Conclusions: This study demonstrates a fluence optimization approach, which could significantly improve the development of next generation radiation therapy solutions while incurring minimal additional computational overhead. PMID:27277051
MUTILS - a set of efficient modeling tools for multi-core CPUs implemented in MEX
NASA Astrophysics Data System (ADS)
Krotkiewski, Marcin; Dabrowski, Marcin
2013-04-01
The need for computational performance is common in scientific applications, and in particular in numerical simulations, where high resolution models require efficient processing of large amounts of data. Especially in the context of geological problems the need to increase the model resolution to resolve physical and geometrical complexities seems to have no limits. Alas, the performance of new generations of CPUs does not improve any longer by simply increasing clock speeds. Current industrial trends are to increase the number of computational cores. As a result, parallel implementations are required in order to fully utilize the potential of new processors, and to study more complex models. We target simulations on small to medium scale shared memory computers: laptops and desktop PCs with ~8 CPU cores and up to tens of GB of memory to high-end servers with ~50 CPU cores and hundereds of GB of memory. In this setting MATLAB is often the environment of choice for scientists that want to implement their own models with little effort. It is a useful general purpose mathematical software package, but due to its versatility some of its functionality is not as efficient as it could be. In particular, the challanges of modern multi-core architectures are not fully addressed. We have developed MILAMIN 2 - an efficient FEM modeling environment written in native MATLAB. Amongst others, MILAMIN provides functions to define model geometry, generate and convert structured and unstructured meshes (also through interfaces to external mesh generators), compute element and system matrices, apply boundary conditions, solve the system of linear equations, address non-linear and transient problems, and perform post-processing. MILAMIN strives to combine the ease of code development and the computational efficiency. Where possible, the code is optimized and/or parallelized within the MATLAB framework. Native MATLAB is augmented with the MUTILS library - a set of MEX functions that implement the computationally intensive, performance critical parts of the code, which we have identified to be bottlenecks. Here, we discuss the functionality and performance of the MUTILS library. Currently, it includes: 1. time and memory efficient assembly of sparse matrices for FEM simulations 2. parallel sparse matrix - vector product with optimizations speficic to symmetric matrices and multiple degrees of freedom per node 3. parallel point in triangle location and point in tetrahedron location for unstructured, adaptive 2D and 3D meshes (useful for 'marker in cell' type of methods) 4. parallel FEM interpolation for 2D and 3D meshes of elements of different types and orders, and for different number of degrees of freedom per node 5. a stand-alone, MEX implementation of the Conjugate Gradients iterative solver 6. interface to METIS graph partitioning and a fast implementation of RCM reordering
2015-04-01
report is to examine how a computer forensic investigator/incident handler, without specialised computer memory or software reverse engineering skills ...The skills amassed by incident handlers and investigators alike while using Volatility to examine Windows memory images will be of some help...bin/pulseaudio --start --log-target=syslog 1362 1000 1000 nautilus 1366 1000 1000 /usr/lib/pulseaudio/pulse/gconf- helper 1370 1000 1000 nm-applet
NASA Technical Reports Server (NTRS)
Mavriplis, D. J.; Das, Raja; Saltz, Joel; Vermeland, R. E.
1992-01-01
An efficient three dimensional unstructured Euler solver is parallelized on a Cray Y-MP C90 shared memory computer and on an Intel Touchstone Delta distributed memory computer. This paper relates the experiences gained and describes the software tools and hardware used in this study. Performance comparisons between two differing architectures are made.
NASA Astrophysics Data System (ADS)
Sourbier, F.; Operto, S.; Virieux, J.
2006-12-01
We present a distributed-memory parallel algorithm for 2D visco-acoustic full-waveform inversion of wide-angle seismic data. Our code is written in fortran90 and use MPI for parallelism. The algorithm was applied to real wide-angle data set recorded by 100 OBSs with a 1-km spacing in the eastern-Nankai trough (Japan) to image the deep structure of the subduction zone. Full-waveform inversion is applied sequentially to discrete frequencies by proceeding from the low to the high frequencies. The inverse problem is solved with a classic gradient method. Full-waveform modeling is performed with a frequency-domain finite-difference method. In the frequency-domain, solving the wave equation requires resolution of a large unsymmetric system of linear equations. We use the massively parallel direct solver MUMPS (http://www.enseeiht.fr/irit/apo/MUMPS) for distributed-memory computer to solve this system. The MUMPS solver is based on a multifrontal method for the parallel factorization. The MUMPS algorithm is subdivided in 3 main steps: a symbolic analysis step that performs re-ordering of the matrix coefficients to minimize the fill-in of the matrix during the subsequent factorization and an estimation of the assembly tree of the matrix. Second, the factorization is performed with dynamic scheduling to accomodate numerical pivoting and provides the LU factors distributed over all the processors. Third, the resolution is performed for multiple sources. To compute the gradient of the cost function, 2 simulations per shot are required (one to compute the forward wavefield and one to back-propagate residuals). The multi-source resolutions can be performed in parallel with MUMPS. In the end, each processor stores in core a sub-domain of all the solutions. These distributed solutions can be exploited to compute in parallel the gradient of the cost function. Since the gradient of the cost function is a weighted stack of the shot and residual solutions of MUMPS, each processor computes the corresponding sub-domain of the gradient. In the end, the gradient is centralized on the master processor using a collective communation. The gradient is scaled by the diagonal elements of the Hessian matrix. This scaling is computed only once per frequency before the first iteration of the inversion. Estimation of the diagonal terms of the Hessian requires performing one simulation per non redondant shot and receiver position. The same strategy that the one used for the gradient is used to compute the diagonal Hessian in parallel. This algorithm was applied to a dense wide-angle data set recorded by 100 OBSs in the eastern Nankai trough, offshore Japan. Thirteen frequencies ranging from 3 and 15 Hz were inverted. Tweny iterations per frequency were computed leading to 260 tomographic velocity models of increasing resolution. The velocity model dimensions are 105 km x 25 km corresponding to a finite-difference grid of 4201 x 1001 grid with a 25-m grid interval. The number of shot was 1005 and the number of inverted OBS gathers was 93. The inversion requires 20 days on 6 32-bits bi-processor nodes with 4 Gbytes of RAM memory per node when only the LU factorization is performed in parallel. Preliminary estimations of the time required to perform the inversion with the fully-parallelized code is 6 and 4 days using 20 and 50 processors respectively.
Hypercluster Parallel Processor
NASA Technical Reports Server (NTRS)
Blech, Richard A.; Cole, Gary L.; Milner, Edward J.; Quealy, Angela
1992-01-01
Hypercluster computer system includes multiple digital processors, operation of which coordinated through specialized software. Configurable according to various parallel-computing architectures of shared-memory or distributed-memory class, including scalar computer, vector computer, reduced-instruction-set computer, and complex-instruction-set computer. Designed as flexible, relatively inexpensive system that provides single programming and operating environment within which one can investigate effects of various parallel-computing architectures and combinations on performance in solution of complicated problems like those of three-dimensional flows in turbomachines. Hypercluster software and architectural concepts are in public domain.
GPU-accelerated adjoint algorithmic differentiation
NASA Astrophysics Data System (ADS)
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2016-03-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the ;tape;. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
GPU-Accelerated Adjoint Algorithmic Differentiation.
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2016-03-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the "tape". Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
GPU-Accelerated Adjoint Algorithmic Differentiation
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2015-01-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the “tape”. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography. PMID:26941443
NASA Astrophysics Data System (ADS)
Yang, L. M.; Shu, C.; Yang, W. M.; Wu, J.
2018-04-01
High consumption of memory and computational effort is the major barrier to prevent the widespread use of the discrete velocity method (DVM) in the simulation of flows in all flow regimes. To overcome this drawback, an implicit DVM with a memory reduction technique for solving a steady discrete velocity Boltzmann equation (DVBE) is presented in this work. In the method, the distribution functions in the whole discrete velocity space do not need to be stored, and they are calculated from the macroscopic flow variables. As a result, its memory requirement is in the same order as the conventional Euler/Navier-Stokes solver. In the meantime, it is more efficient than the explicit DVM for the simulation of various flows. To make the method efficient for solving flow problems in all flow regimes, a prediction step is introduced to estimate the local equilibrium state of the DVBE. In the prediction step, the distribution function at the cell interface is calculated by the local solution of DVBE. For the flow simulation, when the cell size is less than the mean free path, the prediction step has almost no effect on the solution. However, when the cell size is much larger than the mean free path, the prediction step dominates the solution so as to provide reasonable results in such a flow regime. In addition, to further improve the computational efficiency of the developed scheme in the continuum flow regime, the implicit technique is also introduced into the prediction step. Numerical results showed that the proposed implicit scheme can provide reasonable results in all flow regimes and increase significantly the computational efficiency in the continuum flow regime as compared with the existing DVM solvers.
LightAssembler: fast and memory-efficient assembly algorithm for high-throughput sequencing reads.
El-Metwally, Sara; Zakaria, Magdi; Hamza, Taher
2016-11-01
The deluge of current sequenced data has exceeded Moore's Law, more than doubling every 2 years since the next-generation sequencing (NGS) technologies were invented. Accordingly, we will able to generate more and more data with high speed at fixed cost, but lack the computational resources to store, process and analyze it. With error prone high throughput NGS reads and genomic repeats, the assembly graph contains massive amount of redundant nodes and branching edges. Most assembly pipelines require this large graph to reside in memory to start their workflows, which is intractable for mammalian genomes. Resource-efficient genome assemblers combine both the power of advanced computing techniques and innovative data structures to encode the assembly graph efficiently in a computer memory. LightAssembler is a lightweight assembly algorithm designed to be executed on a desktop machine. It uses a pair of cache oblivious Bloom filters, one holding a uniform sample of [Formula: see text]-spaced sequenced [Formula: see text]-mers and the other holding [Formula: see text]-mers classified as likely correct, using a simple statistical test. LightAssembler contains a light implementation of the graph traversal and simplification modules that achieves comparable assembly accuracy and contiguity to other competing tools. Our method reduces the memory usage by [Formula: see text] compared to the resource-efficient assemblers using benchmark datasets from GAGE and Assemblathon projects. While LightAssembler can be considered as a gap-based sequence assembler, different gap sizes result in an almost constant assembly size and genome coverage. https://github.com/SaraEl-Metwally/LightAssembler CONTACT: sarah_almetwally4@mans.edu.egSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Extending the Binomial Checkpointing Technique for Resilience
DOE Office of Scientific and Technical Information (OSTI.GOV)
Walther, Andrea; Narayanan, Sri Hari Krishna
In terms of computing time, adjoint methods offer a very attractive alternative to compute gradient information, re- quired, e.g., for optimization purposes. However, together with this very favorable temporal complexity result comes a memory requirement that is in essence proportional with the operation count of the underlying function, e.g., if algo- rithmic differentiation is used to provide the adjoints. For this reason, checkpointing approaches in many variants have become popular. This paper analyzes an extension of the so-called binomial approach to cover also possible failures of the computing systems. Such a measure of precaution is of special interest for massivemore » parallel simulations and adjoint calculations where the mean time between failure of the large scale computing system is smaller than the time needed to complete the calculation of the adjoint information. We de- scribe the extensions of standard checkpointing approaches required for such resilience, provide a corresponding imple- mentation and discuss numerical results.« less
Flood inundation extent mapping based on block compressed tracing
NASA Astrophysics Data System (ADS)
Shen, Dingtao; Rui, Yikang; Wang, Jiechen; Zhang, Yu; Cheng, Liang
2015-07-01
Flood inundation extent, depth, and duration are important factors affecting flood hazard evaluation. At present, flood inundation analysis is based mainly on a seeded region-growing algorithm, which is an inefficient process because it requires excessive recursive computations and it is incapable of processing massive datasets. To address this problem, we propose a block compressed tracing algorithm for mapping the flood inundation extent, which reads the DEM data in blocks before transferring them to raster compression storage. This allows a smaller computer memory to process a larger amount of data, which solves the problem of the regular seeded region-growing algorithm. In addition, the use of a raster boundary tracing technique allows the algorithm to avoid the time-consuming computations required by the seeded region-growing. Finally, we conduct a comparative evaluation in the Chin-sha River basin, results show that the proposed method solves the problem of flood inundation extent mapping based on massive DEM datasets with higher computational efficiency than the original method, which makes it suitable for practical applications.