Chip architecture - A revolution brewing
NASA Astrophysics Data System (ADS)
Guterl, F.
1983-07-01
Techniques being explored by microchip designers and manufacturers to both speed up memory access and instruction execution while protecting memory are discussed. Attention is given to hardwiring control logic, pipelining for parallel processing, devising orthogonal instruction sets for interchangeable instruction fields, and the development of hardware for implementation of virtual memory and multiuser systems to provide memory management and protection. The inclusion of microcode in mainframes eliminated logic circuits that control timing and gating of the CPU. However, improvements in memory architecture have reduced access time to below that needed for instruction execution. Hardwiring the functions as a virtual memory enhances memory protection. Parallelism involves a redundant architecture, which allows identical operations to be performed simultaneously, and can be directed with microcode to avoid abortion of intermediate instructions once on set of instructions has been completed.
Rutger's CAM2000 chip architecture
NASA Technical Reports Server (NTRS)
Smith, Donald E.; Hall, J. Storrs; Miyake, Keith
1993-01-01
This report describes the architecture and instruction set of the Rutgers CAM2000 memory chip. The CAM2000 combines features of Associative Processing (AP), Content Addressable Memory (CAM), and Dynamic Random Access Memory (DRAM) in a single chip package that is not only DRAM compatible but capable of applying simple massively parallel operations to memory. This document reflects the current status of the CAM2000 architecture and is continually updated to reflect the current state of the architecture and instruction set.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Arumugam, Kamesh
Efficient parallel implementations of scientific applications on multi-core CPUs with accelerators such as GPUs and Xeon Phis is challenging. This requires - exploiting the data parallel architecture of the accelerator along with the vector pipelines of modern x86 CPU architectures, load balancing, and efficient memory transfer between different devices. It is relatively easy to meet these requirements for highly structured scientific applications. In contrast, a number of scientific and engineering applications are unstructured. Getting performance on accelerators for these applications is extremely challenging because many of these applications employ irregular algorithms which exhibit data-dependent control-ow and irregular memory accesses. Furthermore,more » these applications are often iterative with dependency between steps, and thus making it hard to parallelize across steps. As a result, parallelism in these applications is often limited to a single step. Numerical simulation of charged particles beam dynamics is one such application where the distribution of work and memory access pattern at each time step is irregular. Applications with these properties tend to present significant branch and memory divergence, load imbalance between different processor cores, and poor compute and memory utilization. Prior research on parallelizing such irregular applications have been focused around optimizing the irregular, data-dependent memory accesses and control-ow during a single step of the application independent of the other steps, with the assumption that these patterns are completely unpredictable. We observed that the structure of computation leading to control-ow divergence and irregular memory accesses in one step is similar to that in the next step. It is possible to predict this structure in the current step by observing the computation structure of previous steps. In this dissertation, we present novel machine learning based optimization techniques to address the parallel implementation challenges of such irregular applications on different HPC architectures. In particular, we use supervised learning to predict the computation structure and use it to address the control-ow and memory access irregularities in the parallel implementation of such applications on GPUs, Xeon Phis, and heterogeneous architectures composed of multi-core CPUs with GPUs or Xeon Phis. We use numerical simulation of charged particles beam dynamics simulation as a motivating example throughout the dissertation to present our new approach, though they should be equally applicable to a wide range of irregular applications. The machine learning approach presented here use predictive analytics and forecasting techniques to adaptively model and track the irregular memory access pattern at each time step of the simulation to anticipate the future memory access pattern. Access pattern forecasts can then be used to formulate optimization decisions during application execution which improves the performance of the application at a future time step based on the observations from earlier time steps. In heterogeneous architectures, forecasts can also be used to improve the memory performance and resource utilization of all the processing units to deliver a good aggregate performance. We used these optimization techniques and anticipation strategy to design a cache-aware, memory efficient parallel algorithm to address the irregularities in the parallel implementation of charged particles beam dynamics simulation on different HPC architectures. Experimental result using a diverse mix of HPC architectures shows that our approach in using anticipation strategy is effective in maximizing data reuse, ensuring workload balance, minimizing branch and memory divergence, and in improving resource utilization.« less
Kokkos: Enabling manycore performance portability through polymorphic memory access patterns
Carter Edwards, H.; Trott, Christian R.; Sunderland, Daniel
2014-07-22
The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. We found that a major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diversemore » manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. Furthermore, the Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.« less
Optical memories in digital computing
NASA Technical Reports Server (NTRS)
Alford, C. O.; Gaylord, T. K.
1979-01-01
High capacity optical memories with relatively-high data-transfer rate and multiport simultaneous access capability may serve as basis for new computer architectures. Several computer structures that might profitably use memories are: a) simultaneous record-access system, b) simultaneously-shared memory computer system, and c) parallel digital processing structure.
NASA Technical Reports Server (NTRS)
Katti, Romney R.
1995-01-01
Random-access memory (RAM) devices of proposed type exploit magneto-optical properties of magnetic garnets exhibiting perpendicular anisotropy. Magnetic writing and optical readout used. Provides nonvolatile storage and resists damage by ionizing radiation. Because of basic architecture and pinout requirements, most likely useful as small-capacity memory devices.
2010-07-22
dependent , providing a natural bandwidth match between compute cores and the memory subsystem. • High Bandwidth Dcnsity. Waveguides crossing the chip...simulate this memory access architecture on a 2S6-core chip with a concentrated 64-node network lIsing detailed traces of high-performance embedded...memory modulcs, wc placc memory access poi nts (MAPs) around the pcriphery of the chip connected to thc nctwork. These MAPs, shown in Figure 4, contain
NASA Astrophysics Data System (ADS)
Kajiyama, Shinya; Fujito, Masamichi; Kasai, Hideo; Mizuno, Makoto; Yamaguchi, Takanori; Shinagawa, Yutaka
A novel 300MHz embedded flash memory for dual-core microcontrollers with a shared ROM architecture is proposed. One of its features is a three-stage pipeline read operation, which enables reduced access pitch and therefore reduces performance penalty due to conflict of shared ROM accesses. Another feature is a highly sensitive sense amplifier that achieves efficient pipeline operation with two-cycle latency one-cycle pitch as a result of a shortened sense time of 0.63ns. The combination of the pipeline architecture and proposed sense amplifiers significantly reduces access-conflict penalties with shared ROM and enhances performance of 32-bit RISC dual-core microcontrollers by 30%.
Multi-wavelength access gate for WDM-formatted words in optical RAM row architectures
NASA Astrophysics Data System (ADS)
Fitsios, D.; Alexoudi, T.; Vagionas, C.; Miliou, A.; Kanellos, G. T.; Pleros, N.
2013-03-01
Optical RAM has emerged as a promising solution for overcoming the "Memory Wall" of electronics, indicating the use of light in RAM architectures as the approach towards enabling ps-regime memory access times. Taking a step further towards exploiting the unique wavelength properties of optical signals, we reveal new architectural perspectives in optical RAM structures by introducing WDM principles in the storage area. To this end, we demonstrate a novel SOAbased multi-wavelength Access Gate for utilization in a 4x4 WDM optical RAM bank architecture. The proposed multiwavelength Access Gate can simultaneously control random access to a 4-bit optical word, exploiting Cross-Gain-Modulation (XGM) to process 8 Bit and Bit channels encoded in 8 different wavelengths. It also suggests simpler optical RAM row architectures, allowing for the effective sharing of one multi-wavelength Access Gate for each row, substituting the eight AGs in the case of conventional optical RAM architectures. The scheme is shown to support 10Gbit/s operation for the incoming 4-bit data streams, with a power consumption of 15mW/Gbit/s. All 8 wavelength channels demonstrate error-free operation with a power penalty lower than 3 dB for all channels, compared to Back-to-Back measurements. The proposed optical RAM architecture reveals that exploiting the WDM capabilities of optical components can lead to RAM bank implementations with smarter column/row encoders/decoders, increased circuit simplicity, reduced number of active elements and associated power consumption. Moreover, exploitation of the wavelength entity can release significant potential towards reconfigurable optical cache mapping schemes when using the wavelength dimension for memory addressing.
NASA Astrophysics Data System (ADS)
Liu, Chen; Han, Runze; Zhou, Zheng; Huang, Peng; Liu, Lifeng; Liu, Xiaoyan; Kang, Jinfeng
2018-04-01
In this work we present a novel convolution computing architecture based on metal oxide resistive random access memory (RRAM) to process the image data stored in the RRAM arrays. The proposed image storage architecture shows performances of better speed-device consumption efficiency compared with the previous kernel storage architecture. Further we improve the architecture for a high accuracy and low power computing by utilizing the binary storage and the series resistor. For a 28 × 28 image and 10 kernels with a size of 3 × 3, compared with the previous kernel storage approach, the newly proposed architecture shows excellent performances including: 1) almost 100% accuracy within 20% LRS variation and 90% HRS variation; 2) more than 67 times speed boost; 3) 71.4% energy saving.
NASA Astrophysics Data System (ADS)
Liu, Chunsen; Yan, Xiao; Song, Xiongfei; Ding, Shijin; Zhang, David Wei; Zhou, Peng
2018-05-01
As conventional circuits based on field-effect transistors are approaching their physical limits due to quantum phenomena, semi-floating gate transistors have emerged as an alternative ultrafast and silicon-compatible technology. Here, we show a quasi-non-volatile memory featuring a semi-floating gate architecture with band-engineered van der Waals heterostructures. This two-dimensional semi-floating gate memory demonstrates 156 times longer refresh time with respect to that of dynamic random access memory and ultrahigh-speed writing operations on nanosecond timescales. The semi-floating gate architecture greatly enhances the writing operation performance and is approximately 106 times faster than other memories based on two-dimensional materials. The demonstrated characteristics suggest that the quasi-non-volatile memory has the potential to bridge the gap between volatile and non-volatile memory technologies and decrease the power consumption required for frequent refresh operations, enabling a high-speed and low-power random access memory.
Distributed multiport memory architecture
NASA Technical Reports Server (NTRS)
Kohl, W. H. (Inventor)
1983-01-01
A multiport memory architecture is diclosed for each of a plurality of task centers connected to a command and data bus. Each task center, includes a memory and a plurality of devices which request direct memory access as needed. The memory includes an internal data bus and an internal address bus to which the devices are connected, and direct timing and control logic comprised of a 10-state ring counter for allocating memory devices by enabling AND gates connected to the request signal lines of the devices. The outputs of AND gates connected to the same device are combined by OR gates to form an acknowledgement signal that enables the devices to address the memory during the next clock period. The length of the ring counter may be effectively lengthened to any multiple of ten to allow for more direct memory access intervals in one repetitive sequence. One device is a network bus adapter which serially shifts onto the command and data bus, a data word (8 bits plus control and parity bits) during the next ten direct memory access intervals after it has been granted access. The NBA is therefore allocated only one access in every ten intervals, which is a predetermined interval for all centers. The ring counters of all centers are periodically synchronized by DMA SYNC signal to assure that all NBAs be able to function in synchronism for data transfer from one center to another.
The potential of multi-port optical memories in digital computing
NASA Technical Reports Server (NTRS)
Alford, C. O.; Gaylord, T. K.
1975-01-01
A high-capacity memory with a relatively high data transfer rate and multi-port simultaneous access capability may serve as the basis for new computer architectures. The implementation of a multi-port optical memory is discussed. Several computer structures are presented that might profitably use such a memory. These structures include (1) a simultaneous record access system, (2) a simultaneously shared memory computer system, and (3) a parallel digital processing structure.
Giovannetti, Vittorio; Lloyd, Seth; Maccone, Lorenzo
2008-04-25
A random access memory (RAM) uses n bits to randomly address N=2(n) distinct memory cells. A quantum random access memory (QRAM) uses n qubits to address any quantum superposition of N memory cells. We present an architecture that exponentially reduces the requirements for a memory call: O(logN) switches need be thrown instead of the N used in conventional (classical or quantum) RAM designs. This yields a more robust QRAM algorithm, as it in general requires entanglement among exponentially less gates, and leads to an exponential decrease in the power needed for addressing. A quantum optical implementation is presented.
Designing a VMEbus FDDI adapter card
NASA Astrophysics Data System (ADS)
Venkataraman, Raman
1992-03-01
This paper presents a system architecture for a VMEbus FDDI adapter card containing a node core, FDDI block, frame buffer memory and system interface unit. Most of the functions of the PHY and MAC layers of FDDI are implemented with National's FDDI chip set and the SMT implementation is simplified with a low cost microcontroller. The factors that influence the system bus bandwidth utilization and FDDI bandwidth utilization are the data path and frame buffer memory architecture. The VRAM based frame buffer memory has two sections - - LLC frame memory and SMT frame memory. Each section with an independent serial access memory (SAM) port provides an independent access after the initial data transfer cycle on the main port and hence, the throughput is maximized on each port of the memory. The SAM port simplifies the system bus master DMA design and the VMEbus interface can be designed with low-cost off-the-shelf interface chips.
The Effect of NUMA Tunings on CPU Performance
NASA Astrophysics Data System (ADS)
Hollowell, Christopher; Caramarcu, Costin; Strecker-Kellogg, William; Wong, Antonio; Zaytsev, Alexandr
2015-12-01
Non-Uniform Memory Access (NUMA) is a memory architecture for symmetric multiprocessing (SMP) systems where each processor is directly connected to separate memory. Indirect access to other CPU's (remote) RAM is still possible, but such requests are slower as they must also pass through that memory's controlling CPU. In concert with a NUMA-aware operating system, the NUMA hardware architecture can help eliminate the memory performance reductions generally seen in SMP systems when multiple processors simultaneously attempt to access memory. The x86 CPU architecture has supported NUMA for a number of years. Modern operating systems such as Linux support NUMA-aware scheduling, where the OS attempts to schedule a process to the CPU directly attached to the majority of its RAM. In Linux, it is possible to further manually tune the NUMA subsystem using the numactl utility. With the release of Red Hat Enterprise Linux (RHEL) 6.3, the numad daemon became available in this distribution. This daemon monitors a system's NUMA topology and utilization, and automatically makes adjustments to optimize locality. As the number of cores in x86 servers continues to grow, efficient NUMA mappings of processes to CPUs/memory will become increasingly important. This paper gives a brief overview of NUMA, and discusses the effects of manual tunings and numad on the performance of the HEPSPEC06 benchmark, and ATLAS software.
Supporting shared data structures on distributed memory architectures
NASA Technical Reports Server (NTRS)
Koelbel, Charles; Mehrotra, Piyush; Vanrosendale, John
1990-01-01
Programming nonshared memory systems is more difficult than programming shared memory systems, since there is no support for shared data structures. Current programming languages for distributed memory architectures force the user to decompose all data structures into separate pieces, with each piece owned by one of the processors in the machine, and with all communication explicitly specified by low-level message-passing primitives. A new programming environment is presented for distributed memory architectures, providing a global name space and allowing direct access to remote parts of data values. The analysis and program transformations required to implement this environment are described, and the efficiency of the resulting code on the NCUBE/7 and IPSC/2 hypercubes are described.
Kinetic Inductance Memory Cell and Architecture for Superconducting Computers
NASA Astrophysics Data System (ADS)
Chen, George J.
Josephson memory devices typically use a superconducting loop containing one or more Josephson junctions to store information. The magnetic inductance of the loop in conjunction with the Josephson junctions provides multiple states to store data. This thesis shows that replacing the magnetic inductor in a memory cell with a kinetic inductor can lead to a smaller cell size. However, magnetic control of the cells is lost. Thus, a current-injection based architecture for a memory array has been designed to work around this problem. The isolation between memory cells that magnetic control provides is provided through resistors in this new architecture. However, these resistors allow leakage current to flow which ultimately limits the size of the array due to power considerations. A kinetic inductance memory array will be limited to 4K bits with a read access time of 320 ps for a 1 um linewidth technology. If a power decoder could be developed, the memory architecture could serve as the blueprint for a fast (<1 ns), large scale (>1 Mbit) superconducting memory array.
NASA Technical Reports Server (NTRS)
Rogers, David
1988-01-01
The advent of the Connection Machine profoundly changes the world of supercomputers. The highly nontraditional architecture makes possible the exploration of algorithms that were impractical for standard Von Neumann architectures. Sparse distributed memory (SDM) is an example of such an algorithm. Sparse distributed memory is a particularly simple and elegant formulation for an associative memory. The foundations for sparse distributed memory are described, and some simple examples of using the memory are presented. The relationship of sparse distributed memory to three important computational systems is shown: random-access memory, neural networks, and the cerebellum of the brain. Finally, the implementation of the algorithm for sparse distributed memory on the Connection Machine is discussed.
Optoelectronic-cache memory system architecture.
Chiarulli, D M; Levitan, S P
1996-05-10
We present an investigation of the architecture of an optoelectronic cache that can integrate terabit optical memories with the electronic caches associated with high-performance uniprocessors and multiprocessors. The use of optoelectronic-cache memories enables these terabit technologies to provide transparently low-latency secondary memory with frame sizes comparable with disk pages but with latencies that approach those of electronic secondary-cache memories. This enables the implementation of terabit memories with effective access times comparable with the cycle times of current microprocessors. The cache design is based on the use of a smart-pixel array and combines parallel free-space optical input-output to-and-from optical memory with conventional electronic communication to the processor caches. This cache and the optical memory system to which it will interface provide a large random-access memory space that has a lower overall latency than that of magnetic disks and disk arrays. In addition, as a consequence of the high-bandwidth parallel input-output capabilities of optical memories, fault service times for the optoelectronic cache are substantially less than those currently achievable with any rotational media.
NASA Astrophysics Data System (ADS)
Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying
2017-05-01
In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.
Efficient Graph Based Assembly of Short-Read Sequences on Hybrid Core Architecture
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sczyrba, Alex; Pratap, Abhishek; Canon, Shane
2011-03-22
Advanced architectures can deliver dramatically increased throughput for genomics and proteomics applications, reducing time-to-completion in some cases from days to minutes. One such architecture, hybrid-core computing, marries a traditional x86 environment with a reconfigurable coprocessor, based on field programmable gate array (FPGA) technology. In addition to higher throughput, increased performance can fundamentally improve research quality by allowing more accurate, previously impractical approaches. We will discuss the approach used by Convey?s de Bruijn graph constructor for short-read, de-novo assembly. Bioinformatics applications that have random access patterns to large memory spaces, such as graph-based algorithms, experience memory performance limitations on cache-based x86more » servers. Convey?s highly parallel memory subsystem allows application-specific logic to simultaneously access 8192 individual words in memory, significantly increasing effective memory bandwidth over cache-based memory systems. Many algorithms, such as Velvet and other de Bruijn graph based, short-read, de-novo assemblers, can greatly benefit from this type of memory architecture. Furthermore, small data type operations (four nucleotides can be represented in two bits) make more efficient use of logic gates than the data types dictated by conventional programming models.JGI is comparing the performance of Convey?s graph constructor and Velvet on both synthetic and real data. We will present preliminary results on memory usage and run time metrics for various data sets with different sizes, from small microbial and fungal genomes to very large cow rumen metagenome. For genomes with references we will also present assembly quality comparisons between the two assemblers.« less
Scalable Motion Estimation Processor Core for Multimedia System-on-Chip Applications
NASA Astrophysics Data System (ADS)
Lai, Yeong-Kang; Hsieh, Tian-En; Chen, Lien-Fei
2007-04-01
In this paper, we describe a high-throughput and scalable motion estimation processor architecture for multimedia system-on-chip applications. The number of processing elements (PEs) is scalable according to the variable algorithm parameters and the performance required for different applications. Using the PE rings efficiently and an intelligent memory-interleaving organization, the efficiency of the architecture can be increased. Moreover, using efficient on-chip memories and a data management technique can effectively decrease the power consumption and memory bandwidth. Techniques for reducing the number of interconnections and external memory accesses are also presented. Our results demonstrate that the proposed scalable PE-ringed architecture is a flexible and high-performance processor core in multimedia system-on-chip applications.
Architectural design and simulation of a virtual memory
NASA Technical Reports Server (NTRS)
Kwok, G.; Chu, Y.
1971-01-01
Virtual memory is an imaginary main memory with a very large capacity which the programmer has at his disposal. It greatly contributes to the solution of the dynamic storage allocation problem. The architectural design of a virtual memory is presented which implements by hardware the idea of queuing and scheduling the page requests to a paging drum in such a way that the access of the paging drum is increased many times. With the design, an increase of up to 16 times in page transfer rate is achievable when the virtual memory is heavily loaded. This in turn makes feasible a great increase in the system throughput.
Carbon nanomaterials for non-volatile memories
NASA Astrophysics Data System (ADS)
Ahn, Ethan C.; Wong, H.-S. Philip; Pop, Eric
2018-03-01
Carbon can create various low-dimensional nanostructures with remarkable electronic, optical, mechanical and thermal properties. These features make carbon nanomaterials especially interesting for next-generation memory and storage devices, such as resistive random access memory, phase-change memory, spin-transfer-torque magnetic random access memory and ferroelectric random access memory. Non-volatile memories greatly benefit from the use of carbon nanomaterials in terms of bit density and energy efficiency. In this Review, we discuss sp2-hybridized carbon-based low-dimensional nanostructures, such as fullerene, carbon nanotubes and graphene, in the context of non-volatile memory devices and architectures. Applications of carbon nanomaterials as memory electrodes, interfacial engineering layers, resistive-switching media, and scalable, high-performance memory selectors are investigated. Finally, we compare the different memory technologies in terms of writing energy and time, and highlight major challenges in the manufacturing, integration and understanding of the physical mechanisms and material properties.
Multiprocessor architecture: Synthesis and evaluation
NASA Technical Reports Server (NTRS)
Standley, Hilda M.
1990-01-01
Multiprocessor computed architecture evaluation for structural computations is the focus of the research effort described. Results obtained are expected to lead to more efficient use of existing architectures and to suggest designs for new, application specific, architectures. The brief descriptions given outline a number of related efforts directed toward this purpose. The difficulty is analyzing an existing architecture or in designing a new computer architecture lies in the fact that the performance of a particular architecture, within the context of a given application, is determined by a number of factors. These include, but are not limited to, the efficiency of the computation algorithm, the programming language and support environment, the quality of the program written in the programming language, the multiplicity of the processing elements, the characteristics of the individual processing elements, the interconnection network connecting processors and non-local memories, and the shared memory organization covering the spectrum from no shared memory (all local memory) to one global access memory. These performance determiners may be loosely classified as being software or hardware related. This distinction is not clear or even appropriate in many cases. The effect of the choice of algorithm is ignored by assuming that the algorithm is specified as given. Effort directed toward the removal of the effect of the programming language and program resulted in the design of a high-level parallel programming language. Two characteristics of the fundamental structure of the architecture (memory organization and interconnection network) are examined.
Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Gaeke, Brian R.; Husbands, Parry; Li, Xiaoye S.; Oliker, Leonid; Yelick, Katherine A.; Biegel, Bryan (Technical Monitor)
2002-01-01
The increasing gap between processor and memory performance has lead to new architectural models for memory-intensive applications. In this paper, we explore the performance of a set of memory-intensive benchmarks and use them to compare the performance of conventional cache-based microprocessors to a mixed logic and DRAM processor called VIRAM. The benchmarks are based on problem statements, rather than specific implementations, and in each case we explore the fundamental hardware requirements of the problem, as well as alternative algorithms and data structures that can help expose fine-grained parallelism or simplify memory access patterns. The benchmarks are characterized by their memory access patterns, their basic control structures, and the ratio of computation to memory operation.
A performance comparison of the IBM RS/6000 and the Astronautics ZS-1
DOE Office of Scientific and Technical Information (OSTI.GOV)
Smith, W.M.; Abraham, S.G.; Davidson, E.S.
1991-01-01
Concurrent uniprocessor architectures, of which vector and superscalar are two examples, are designed to capitalize on fine-grain parallelism. The authors have developed a performance evaluation method for comparing and improving these architectures, and in this article they present the methodology and a detailed case study of two machines. The runtime of many programs is dominated by time spent in loop constructs - for example, Fortran Do-loops. Loops generally comprise two logical processes: The access process generates addresses for memory operations while the execute process operates on floating-point data. Memory access patterns typically can be generated independently of the data inmore » the execute process. This independence allows the access process to slip ahead, thereby hiding memory latency. The IBM 360/91 was designed in 1967 to achieve slip dynamically, at runtime. One CPU unit executes integer operations while another handles floating-point operations. Other machines, including the VAX 9000 and the IBM RS/6000, use a similar approach.« less
Aspects of GPU perfomance in algorithms with random memory access
NASA Astrophysics Data System (ADS)
Kashkovsky, Alexander V.; Shershnev, Anton A.; Vashchenkov, Pavel V.
2017-10-01
The numerical code for solving the Boltzmann equation on the hybrid computational cluster using the Direct Simulation Monte Carlo (DSMC) method showed that on Tesla K40 accelerators computational performance drops dramatically with increase of percentage of occupied GPU memory. Testing revealed that memory access time increases tens of times after certain critical percentage of memory is occupied. Moreover, it seems to be the common problem of all NVidia's GPUs arising from its architecture. Few modifications of the numerical algorithm were suggested to overcome this problem. One of them, based on the splitting the memory into "virtual" blocks, resulted in 2.5 times speed up.
Bubble memory module for spacecraft application
NASA Technical Reports Server (NTRS)
Hayes, P. J.; Looney, K. T.; Nichols, C. D.
1985-01-01
Bubble domain technology offers an all-solid-state alternative for data storage in onboard data systems. A versatile modular bubble memory concept was developed. The key module is the bubble memory module which contains all of the storage devices and circuitry for accessing these devices. This report documents the bubble memory module design and preliminary hardware designs aimed at memory module functional demonstration with available commercial bubble devices. The system architecture provides simultaneous operation of bubble devices to attain high data rates. Banks of bubble devices are accessed by a given bubble controller to minimize controller parts. A power strobing technique is discussed which could minimize the average system power dissipation. A fast initialization method using EEPROM (electrically erasable, programmable read-only memory) devices promotes fast access. Noise and crosstalk problems and implementations to minimize these are discussed. Flight memory systems which incorporate the concepts and techniques of this work could now be developed for applications.
Wide-Range Motion Estimation Architecture with Dual Search Windows for High Resolution Video Coding
NASA Astrophysics Data System (ADS)
Dung, Lan-Rong; Lin, Meng-Chun
This paper presents a memory-efficient motion estimation (ME) technique for high-resolution video compression. The main objective is to reduce the external memory access, especially for limited local memory resource. The reduction of memory access can successfully save the notorious power consumption. The key to reduce the memory accesses is based on center-biased algorithm in that the center-biased algorithm performs the motion vector (MV) searching with the minimum search data. While considering the data reusability, the proposed dual-search-windowing (DSW) approaches use the secondary windowing as an option per searching necessity. By doing so, the loading of search windows can be alleviated and hence reduce the required external memory bandwidth. The proposed techniques can save up to 81% of external memory bandwidth and require only 135 MBytes/sec, while the quality degradation is less than 0.2dB for 720p HDTV clips coded at 8Mbits/sec.
Non-volatile memory based on the ferroelectric photovoltaic effect
Guo, Rui; You, Lu; Zhou, Yang; Shiuh Lim, Zhi; Zou, Xi; Chen, Lang; Ramesh, R.; Wang, Junling
2013-01-01
The quest for a solid state universal memory with high-storage density, high read/write speed, random access and non-volatility has triggered intense research into new materials and novel device architectures. Though the non-volatile memory market is dominated by flash memory now, it has very low operation speed with ~10 μs programming and ~10 ms erasing time. Furthermore, it can only withstand ~105 rewriting cycles, which prevents it from becoming the universal memory. Here we demonstrate that the significant photovoltaic effect of a ferroelectric material, such as BiFeO3 with a band gap in the visible range, can be used to sense the polarization direction non-destructively in a ferroelectric memory. A prototype 16-cell memory based on the cross-bar architecture has been prepared and tested, demonstrating the feasibility of this technique. PMID:23756366
NASA Technical Reports Server (NTRS)
Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin J.
2013-01-01
The Mobile Thread Task Manager (MTTM) is being applied to parallelizing existing flight software to understand the benefits and to develop new techniques and architectural concepts for adapting software to multicore architectures. It allocates and load-balances tasks for a group of threads that migrate across processors to improve cache performance. In order to balance-load across threads, the MTTM augments a basic map-reduce strategy to draw jobs from a global queue. In a multicore processor, memory may be "homed" to the cache of a specific processor and must be accessed from that processor. The MTTB architecture wraps access to data with thread management to move threads to the home processor for that data so that the computation follows the data in an attempt to avoid L2 cache misses. Cache homing is also handled by a memory manager that translates identifiers to processor IDs where the data will be homed (according to rules defined by the user). The user can also specify the number of threads and processors separately, which is important for tuning performance for different patterns of computation and memory access. MTTM efficiently processes tasks in parallel on a multiprocessor computer. It also provides an interface to make it easier to adapt existing software to a multiprocessor environment.
Integrating Software Modules For Robot Control
NASA Technical Reports Server (NTRS)
Volpe, Richard A.; Khosla, Pradeep; Stewart, David B.
1993-01-01
Reconfigurable, sensor-based control system uses state variables in systematic integration of reusable control modules. Designed for open-architecture hardware including many general-purpose microprocessors, each having own local memory plus access to global shared memory. Implemented in software as extension of Chimera II real-time operating system. Provides transparent computing mechanism for intertask communication between control modules and generic process-module architecture for multiprocessor realtime computation. Used to control robot arm. Proves useful in variety of other control and robotic applications.
OS friendly microprocessor architecture: Hardware level computer security
NASA Astrophysics Data System (ADS)
Jungwirth, Patrick; La Fratta, Patrick
2016-05-01
We present an introduction to the patented OS Friendly Microprocessor Architecture (OSFA) and hardware level computer security. Conventional microprocessors have not tried to balance hardware performance and OS performance at the same time. Conventional microprocessors have depended on the Operating System for computer security and information assurance. The goal of the OS Friendly Architecture is to provide a high performance and secure microprocessor and OS system. We are interested in cyber security, information technology (IT), and SCADA control professionals reviewing the hardware level security features. The OS Friendly Architecture is a switched set of cache memory banks in a pipeline configuration. For light-weight threads, the memory pipeline configuration provides near instantaneous context switching times. The pipelining and parallelism provided by the cache memory pipeline provides for background cache read and write operations while the microprocessor's execution pipeline is running instructions. The cache bank selection controllers provide arbitration to prevent the memory pipeline and microprocessor's execution pipeline from accessing the same cache bank at the same time. This separation allows the cache memory pages to transfer to and from level 1 (L1) caching while the microprocessor pipeline is executing instructions. Computer security operations are implemented in hardware. By extending Unix file permissions bits to each cache memory bank and memory address, the OSFA provides hardware level computer security.
NASA Technical Reports Server (NTRS)
Harper, Richard E.; Butler, Bryan P.
1990-01-01
The Draper fault-tolerant processor with fault-tolerant shared memory (FTP/FTSM), which is designed to allow application tasks to continue execution during the memory alignment process, is described. Processor performance is not affected by memory alignment. In addition, the FTP/FTSM incorporates a hardware scrubber device to perform the memory alignment quickly during unused memory access cycles. The FTP/FTSM architecture is described, followed by an estimate of the time required for channel reintegration.
Vortex-Core Reversal Dynamics: Towards Vortex Random Access Memory
NASA Astrophysics Data System (ADS)
Kim, Sang-Koog
2011-03-01
An energy-efficient, ultrahigh-density, ultrafast, and nonvolatile solid-state universal memory is a long-held dream in the field of information-storage technology. The magnetic random access memory (MRAM) along with a spin-transfer-torque switching mechanism is a strong candidate-means of realizing that dream, given its nonvolatility, infinite endurance, and fast random access. Magnetic vortices in patterned soft magnetic dots promise ground-breaking applications in information-storage devices, owing to the very stable twofold ground states of either their upward or downward core magnetization orientation and plausible core switching by in-plane alternating magnetic fields or spin-polarized currents. However, two technologically most important but very challenging issues --- low-power recording and reliable selection of each memory cell with already existing cross-point architectures --- have not yet been resolved for the basic operations in information storage, that is, writing (recording) and readout. Here, we experimentally demonstrate a magnetic vortex random access memory (VRAM) in the basic cross-point architecture. This unique VRAM offers reliable cell selection and low-power-consumption control of switching of out-of-plane core magnetizations using specially designed rotating magnetic fields generated by two orthogonal and unipolar Gaussian-pulse currents along with optimized pulse width and time delay. Our achievement of a new device based on a new material, that is, a medium composed of patterned vortex-state disks, together with the new physics on ultrafast vortex-core switching dynamics, can stimulate further fruitful research on MRAMs that are based on vortex-state dot arrays.
Shared Memory Parallelization of an Implicit ADI-type CFD Code
NASA Technical Reports Server (NTRS)
Hauser, Th.; Huang, P. G.
1999-01-01
A parallelization study designed for ADI-type algorithms is presented using the OpenMP specification for shared-memory multiprocessor programming. Details of optimizations specifically addressed to cache-based computer architectures are described and performance measurements for the single and multiprocessor implementation are summarized. The paper demonstrates that optimization of memory access on a cache-based computer architecture controls the performance of the computational algorithm. A hybrid MPI/OpenMP approach is proposed for clusters of shared memory machines to further enhance the parallel performance. The method is applied to develop a new LES/DNS code, named LESTool. A preliminary DNS calculation of a fully developed channel flow at a Reynolds number of 180, Re(sub tau) = 180, has shown good agreement with existing data.
Edla, Damodar Reddy; Kuppili, Venkatanareshbabu; Dharavath, Ramesh; Beechu, Nareshkumar Reddy
2017-01-01
Low-power wearable devices for disease diagnosis are used at anytime and anywhere. These are non-invasive and pain-free for the better quality of life. However, these devices are resource constrained in terms of memory and processing capability. Memory constraint allows these devices to store a limited number of patterns and processing constraint provides delayed response. It is a challenging task to design a robust classification system under above constraints with high accuracy. In this Letter, to resolve this problem, a novel architecture for weightless neural networks (WNNs) has been proposed. It uses variable sized random access memories to optimise the memory usage and a modified binary TRIE data structure for reducing the test time. In addition, a bio-inspired-based genetic algorithm has been employed to improve the accuracy. The proposed architecture is experimented on various disease datasets using its software and hardware realisations. The experimental results prove that the proposed architecture achieves better performance in terms of accuracy, memory saving and test time as compared to standard WNNs. It also outperforms in terms of accuracy as compared to conventional neural network-based classifiers. The proposed architecture is a powerful part of most of the low-power wearable devices for the solution of memory, accuracy and time issues. PMID:28868148
Memory access in shared virtual memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Berrendorf, R.
1992-01-01
Shared virtual memory (SVM) is a virtual memory layer with a single address space on top of a distributed real memory on parallel computers. We examine the behavior and performance of SVM running a parallel program with medium-grained, loop-level parallelism on top of it. A simulator for the underlying parallel architecture can be used to examine the behavior of SVM more deeply. The influence of several parameters, such as the number of processors, page size, cold or warm start, and restricted page replication, is studied.
Memory access in shared virtual memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Berrendorf, R.
1992-09-01
Shared virtual memory (SVM) is a virtual memory layer with a single address space on top of a distributed real memory on parallel computers. We examine the behavior and performance of SVM running a parallel program with medium-grained, loop-level parallelism on top of it. A simulator for the underlying parallel architecture can be used to examine the behavior of SVM more deeply. The influence of several parameters, such as the number of processors, page size, cold or warm start, and restricted page replication, is studied.
NASA Technical Reports Server (NTRS)
Morfopoulos, Arin C.; Pham, Thang D.
2013-01-01
JPL has produced a series of FPGA (field programmable gate array) vision algorithms that were written with custom interfaces to get data in and out of each vision module. Each module has unique requirements on the data interface, and further vision modules are continually being developed, each with their own custom interfaces. Each memory module had also been designed for direct access to memory or to another memory module.
A Cerebellar-model Associative Memory as a Generalized Random-access Memory
NASA Technical Reports Server (NTRS)
Kanerva, Pentti
1989-01-01
A versatile neural-net model is explained in terms familiar to computer scientists and engineers. It is called the sparse distributed memory, and it is a random-access memory for very long words (for patterns with thousands of bits). Its potential utility is the result of several factors: (1) a large pattern representing an object or a scene or a moment can encode a large amount of information about what it represents; (2) this information can serve as an address to the memory, and it can also serve as data; (3) the memory is noise tolerant--the information need not be exact; (4) the memory can be made arbitrarily large and hence an arbitrary amount of information can be stored in it; and (5) the architecture is inherently parallel, allowing large memories to be fast. Such memories can become important components of future computers.
DOE Office of Scientific and Technical Information (OSTI.GOV)
De Supinski, B.; Caliga, D.
2017-09-28
The primary objective of this project was to develop memory optimization technology to efficiently deliver data to, and distribute data within, the SRC-6's Field Programmable Gate Array- ("FPGA") based Multi-Adaptive Processors (MAPs). The hardware/software approach was to explore efficient MAP configurations and generate the compiler technology to exploit those configurations. This memory accessing technology represents an important step towards making reconfigurable symmetric multi-processor (SMP) architectures that will be a costeffective solution for large-scale scientific computing.
PIMS: Memristor-Based Processing-in-Memory-and-Storage.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cook, Jeanine
Continued progress in computing has augmented the quest for higher performance with a new quest for higher energy efficiency. This has led to the re-emergence of Processing-In-Memory (PIM) ar- chitectures that offer higher density and performance with some boost in energy efficiency. Past PIM work either integrated a standard CPU with a conventional DRAM to improve the CPU- memory link, or used a bit-level processor with Single Instruction Multiple Data (SIMD) control, but neither matched the energy consumption of the memory to the computation. We originally proposed to develop a new architecture derived from PIM that more effectively addressed energymore » efficiency for high performance scientific, data analytics, and neuromorphic applications. We also originally planned to implement a von Neumann architecture with arithmetic/logic units (ALUs) that matched the power consumption of an advanced storage array to maximize energy efficiency. Implementing this architecture in storage was our original idea, since by augmenting storage (in- stead of memory), the system could address both in-memory computation and applications that accessed larger data sets directly from storage, hence Processing-in-Memory-and-Storage (PIMS). However, as our research matured, we discovered several things that changed our original direc- tion, the most important being that a PIM that implements a standard von Neumann-type archi- tecture results in significant energy efficiency improvement, but only about a O(10) performance improvement. In addition to this, the emergence of new memory technologies moved us to propos- ing a non-von Neumann architecture, called Superstrider, implemented not in storage, but in a new DRAM technology called High Bandwidth Memory (HBM). HBM is a stacked DRAM tech- nology that includes a logic layer where an architecture such as Superstrider could potentially be implemented.« less
Respecting Relations: Memory Access and Antecedent Retrieval in Incremental Sentence Processing
ERIC Educational Resources Information Center
Kush, Dave W.
2013-01-01
This dissertation uses the processing of anaphoric relations to probe how linguistic information is encoded in and retrieved from memory during real-time sentence comprehension. More specifically, the dissertation attempts to resolve a tension between the demands of a linguistic processor implemented in a general-purpose cognitive architecture and…
Kiefer, Gundolf; Lehmann, Helko; Weese, Jürgen
2006-04-01
Maximum intensity projections (MIPs) are an important visualization technique for angiographic data sets. Efficient data inspection requires frame rates of at least five frames per second at preserved image quality. Despite the advances in computer technology, this task remains a challenge. On the one hand, the sizes of computed tomography and magnetic resonance images are increasing rapidly. On the other hand, rendering algorithms do not automatically benefit from the advances in processor technology, especially for large data sets. This is due to the faster evolving processing power and the slower evolving memory access speed, which is bridged by hierarchical cache memory architectures. In this paper, we investigate memory access optimization methods and use them for generating MIPs on general-purpose central processing units (CPUs) and graphics processing units (GPUs), respectively. These methods can work on any level of the memory hierarchy, and we show that properly combined methods can optimize memory access on multiple levels of the hierarchy at the same time. We present performance measurements to compare different algorithm variants and illustrate the influence of the respective techniques. On current hardware, the efficient handling of the memory hierarchy for CPUs improves the rendering performance by a factor of 3 to 4. On GPUs, we observed that the effect is even larger, especially for large data sets. The methods can easily be adjusted to different hardware specifics, although their impact can vary considerably. They can also be used for other rendering techniques than MIPs, and their use for more general image processing task could be investigated in the future.
NASA Technical Reports Server (NTRS)
Waheed, Abdul; Yan, Jerry
1998-01-01
This paper presents a model to evaluate the performance and overhead of parallelizing sequential code using compiler directives for multiprocessing on distributed shared memory (DSM) systems. With increasing popularity of shared address space architectures, it is essential to understand their performance impact on programs that benefit from shared memory multiprocessing. We present a simple model to characterize the performance of programs that are parallelized using compiler directives for shared memory multiprocessing. We parallelized the sequential implementation of NAS benchmarks using native Fortran77 compiler directives for an Origin2000, which is a DSM system based on a cache-coherent Non Uniform Memory Access (ccNUMA) architecture. We report measurement based performance of these parallelized benchmarks from four perspectives: efficacy of parallelization process; scalability; parallelization overhead; and comparison with hand-parallelized and -optimized version of the same benchmarks. Our results indicate that sequential programs can conveniently be parallelized for DSM systems using compiler directives but realizing performance gains as predicted by the performance model depends primarily on minimizing architecture-specific data locality overhead.
Ferroelectric tunneling element and memory applications which utilize the tunneling element
Kalinin, Sergei V [Knoxville, TN; Christen, Hans M [Knoxville, TN; Baddorf, Arthur P [Knoxville, TN; Meunier, Vincent [Knoxville, TN; Lee, Ho Nyung [Oak Ridge, TN
2010-07-20
A tunneling element includes a thin film layer of ferroelectric material and a pair of dissimilar electrically-conductive layers disposed on opposite sides of the ferroelectric layer. Because of the dissimilarity in composition or construction between the electrically-conductive layers, the electron transport behavior of the electrically-conductive layers is polarization dependent when the tunneling element is below the Curie temperature of the layer of ferroelectric material. The element can be used as a basis of compact 1R type non-volatile random access memory (RAM). The advantages include extremely simple architecture, ultimate scalability and fast access times generic for all ferroelectric memories.
Non-volatile magnetic random access memory
NASA Technical Reports Server (NTRS)
Katti, Romney R. (Inventor); Stadler, Henry L. (Inventor); Wu, Jiin-Chuan (Inventor)
1994-01-01
Improvements are made in a non-volatile magnetic random access memory. Such a memory is comprised of an array of unit cells, each having a Hall-effect sensor and a thin-film magnetic element made of material having an in-plane, uniaxial anisotropy and in-plane, bipolar remanent magnetization states. The Hall-effect sensor is made more sensitive by using a 1 m thick molecular beam epitaxy grown InAs layer on a silicon substrate by employing a GaAs/AlGaAs/InAlAs superlattice buffering layer. One improvement avoids current shunting problems of matrix architecture. Another improvement reduces the required magnetizing current for the micromagnets. Another improvement relates to the use of GaAs technology wherein high electron-mobility GaAs MESFETs provide faster switching times. Still another improvement relates to a method for configuring the invention as a three-dimensional random access memory.
Ultra-High Density Holographic Memory Module with Solid-State Architecture
NASA Technical Reports Server (NTRS)
Markov, Vladimir B.
2000-01-01
NASA's terrestrial. space, and deep-space missions require technology that allows storing. retrieving, and processing a large volume of information. Holographic memory offers high-density data storage with parallel access and high throughput. Several methods exist for data multiplexing based on the fundamental principles of volume hologram selectivity. We recently demonstrated that a spatial (amplitude-phase) encoding of the reference wave (SERW) looks promising as a way to increase the storage density. The SERW hologram offers a method other than traditional methods of selectivity, such as spatial de-correlation between recorded and reconstruction fields, In this report we present the experimental results of the SERW-hologram memory module with solid-state architecture, which is of particular interest for space operations.
A Bandwidth-Optimized Multi-Core Architecture for Irregular Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Secchi, Simone; Tumeo, Antonino; Villa, Oreste
This paper presents an architecture template for next-generation high performance computing systems specifically targeted to irregular applications. We start our work by considering that future generation interconnection and memory bandwidth full-system numbers are expected to grow by a factor of 10. In order to keep up with such a communication capacity, while still resorting to fine-grained multithreading as the main way to tolerate unpredictable memory access latencies of irregular applications, we show how overall performance scaling can benefit from the multi-core paradigm. At the same time, we also show how such an architecture template must be coupled with specific techniquesmore » in order to optimize bandwidth utilization and achieve the maximum scalability. We propose a technique based on memory references aggregation, together with the related hardware implementation, as one of such optimization techniques. We explore the proposed architecture template by focusing on the Cray XMT architecture and, using a dedicated simulation infrastructure, validate the performance of our template with two typical irregular applications. Our experimental results prove the benefits provided by both the multi-core approach and the bandwidth optimization reference aggregation technique.« less
Wang, Kang; Gu, Huaxi; Yang, Yintang; Wang, Kun
2015-08-10
With the number of cores increasing, there is an emerging need for a high-bandwidth low-latency interconnection network, serving core-to-memory communication. In this paper, aiming at the goal of simultaneous access to multi-rank memory, we propose an optical interconnection network for core-to-memory communication. In the proposed network, the wavelength usage is delicately arranged so that cores can communicate with different ranks at the same time and broadcast for flow control can be achieved. A distributed memory controller architecture that works in a pipeline mode is also designed for efficient optical communication and transaction address processes. The scaling method and wavelength assignment for the proposed network are investigated. Compared with traditional electronic bus-based core-to-memory communication, the simulation results based on the PARSEC benchmark show that the bandwidth enhancement and latency reduction are apparent.
Integrated Vertical Bloch Line (VBL) memory
NASA Technical Reports Server (NTRS)
Katti, R. R.; Wu, J. C.; Stadler, H. L.
1991-01-01
Vertical Bloch Line (VBL) Memory is a recently conceived, integrated, solid state, block access, VLSI memory which offers the potential of 1 Gbit/sq cm areal storage density, data rates of hundreds of megabits/sec, and submillisecond average access time simultaneously at relatively low mass, volume, and power values when compared to alternative technologies. VBLs are micromagnetic structures within magnetic domain walls which can be manipulated using magnetic fields from integrated conductors. The presence or absence of BVL pairs are used to store binary information. At present, efforts are being directed at developing a single chip memory using 25 Mbit/sq cm technology in magnetic garnet material which integrates, at a single operating point, the writing, storage, reading, and amplification functions needed in a memory. The current design architecture, functional elements, and supercomputer simulation results are described which are used to assist the design process.
CUDA Optimization Strategies for Compute- and Memory-Bound Neuroimaging Algorithms
Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W.
2011-01-01
As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. PMID:21159404
CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms.
Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W
2012-06-01
As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.
A simple GPU-accelerated two-dimensional MUSCL-Hancock solver for ideal magnetohydrodynamics
NASA Astrophysics Data System (ADS)
Bard, Christopher M.; Dorelli, John C.
2014-02-01
We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of ≈126 for a 10242 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.
NASA Technical Reports Server (NTRS)
Stehle, Roy H.; Ogier, Richard G.
1993-01-01
Alternatives for realizing a packet-based network switch for use on a frequency division multiple access/time division multiplexed (FDMA/TDM) geostationary communication satellite were investigated. Each of the eight downlink beams supports eight directed dwells. The design needed to accommodate multicast packets with very low probability of loss due to contention. Three switch architectures were designed and analyzed. An output-queued, shared bus system yielded a functionally simple system, utilizing a first-in, first-out (FIFO) memory per downlink dwell, but at the expense of a large total memory requirement. A shared memory architecture offered the most efficiency in memory requirements, requiring about half the memory of the shared bus design. The processing requirement for the shared-memory system adds system complexity that may offset the benefits of the smaller memory. An alternative design using a shared memory buffer per downlink beam decreases circuit complexity through a distributed design, and requires at most 1000 packets of memory more than the completely shared memory design. Modifications to the basic packet switch designs were proposed to accommodate circuit-switched traffic, which must be served on a periodic basis with minimal delay. Methods for dynamically controlling the downlink dwell lengths were developed and analyzed. These methods adapt quickly to changing traffic demands, and do not add significant complexity or cost to the satellite and ground station designs. Methods for reducing the memory requirement by not requiring the satellite to store full packets were also proposed and analyzed. In addition, optimal packet and dwell lengths were computed as functions of memory size for the three switch architectures.
A review of emerging non-volatile memory (NVM) technologies and applications
NASA Astrophysics Data System (ADS)
Chen, An
2016-11-01
This paper will review emerging non-volatile memory (NVM) technologies, with the focus on phase change memory (PCM), spin-transfer-torque random-access-memory (STTRAM), resistive random-access-memory (RRAM), and ferroelectric field-effect-transistor (FeFET) memory. These promising NVM devices are evaluated in terms of their advantages, challenges, and applications. Their performance is compared based on reported parameters of major industrial test chips. Memory selector devices and cell structures are discussed. Changing market trends toward low power (e.g., mobile, IoT) and data-centric applications create opportunities for emerging NVMs. High-performance and low-cost emerging NVMs may simplify memory hierarchy, introduce non-volatility in logic gates and circuits, reduce system power, and enable novel architectures. Storage-class memory (SCM) based on high-density NVMs could fill the performance and density gap between memory and storage. Some unique characteristics of emerging NVMs can be utilized for novel applications beyond the memory space, e.g., neuromorphic computing, hardware security, etc. In the beyond-CMOS era, emerging NVMs have the potential to fulfill more important functions and enable more efficient, intelligent, and secure computing systems.
Power and Performance Trade-offs for Space Time Adaptive Processing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gawande, Nitin A.; Manzano Franco, Joseph B.; Tumeo, Antonino
Computational efficiency – performance relative to power or energy – is one of the most important concerns when designing RADAR processing systems. This paper analyzes power and performance trade-offs for a typical Space Time Adaptive Processing (STAP) application. We study STAP implementations for CUDA and OpenMP on two computationally efficient architectures, Intel Haswell Core I7-4770TE and NVIDIA Kayla with a GK208 GPU. We analyze the power and performance of STAP’s computationally intensive kernels across the two hardware testbeds. We also show the impact and trade-offs of GPU optimization techniques. We show that data parallelism can be exploited for efficient implementationmore » on the Haswell CPU architecture. The GPU architecture is able to process large size data sets without increase in power requirement. The use of shared memory has a significant impact on the power requirement for the GPU. A balance between the use of shared memory and main memory access leads to an improved performance in a typical STAP application.« less
Real-time FPGA architectures for computer vision
NASA Astrophysics Data System (ADS)
Arias-Estrada, Miguel; Torres-Huitzil, Cesar
2000-03-01
This paper presents an architecture for real-time generic convolution of a mask and an image. The architecture is intended for fast low level image processing. The FPGA-based architecture takes advantage of the availability of registers in FPGAs to implement an efficient and compact module to process the convolutions. The architecture is designed to minimize the number of accesses to the image memory and is based on parallel modules with internal pipeline operation in order to improve its performance. The architecture is prototyped in a FPGA, but it can be implemented on a dedicated VLSI to reach higher clock frequencies. Complexity issues, FPGA resources utilization, FPGA limitations, and real time performance are discussed. Some results are presented and discussed.
Hadwiger, M; Beyer, J; Jeong, Won-Ki; Pfister, H
2012-12-01
This paper presents the first volume visualization system that scales to petascale volumes imaged as a continuous stream of high-resolution electron microscopy images. Our architecture scales to dense, anisotropic petascale volumes because it: (1) decouples construction of the 3D multi-resolution representation required for visualization from data acquisition, and (2) decouples sample access time during ray-casting from the size of the multi-resolution hierarchy. Our system is designed around a scalable multi-resolution virtual memory architecture that handles missing data naturally, does not pre-compute any 3D multi-resolution representation such as an octree, and can accept a constant stream of 2D image tiles from the microscopes. A novelty of our system design is that it is visualization-driven: we restrict most computations to the visible volume data. Leveraging the virtual memory architecture, missing data are detected during volume ray-casting as cache misses, which are propagated backwards for on-demand out-of-core processing. 3D blocks of volume data are only constructed from 2D microscope image tiles when they have actually been accessed during ray-casting. We extensively evaluate our system design choices with respect to scalability and performance, compare to previous best-of-breed systems, and illustrate the effectiveness of our system for real microscopy data from neuroscience.
Soft errors in commercial off-the-shelf static random access memories
NASA Astrophysics Data System (ADS)
Dilillo, L.; Tsiligiannis, G.; Gupta, V.; Bosser, A.; Saigne, F.; Wrobel, F.
2017-01-01
This article reviews state-of-the-art techniques for the evaluation of the effect of radiation on static random access memory (SRAM). We detailed irradiation test techniques and results from irradiation experiments with several types of particles. Two commercial SRAMs, in 90 and 65 nm technology nodes, were considered as case studies. Besides the basic static and dynamic test modes, advanced stimuli for the irradiation tests were introduced, as well as statistical post-processing techniques allowing for deeper analysis of the correlations between bit-flip cross-sections and design/architectural characteristics of the memory device. Further insight is provided on the response of irradiated stacked layer devices and on the use of characterized SRAM devices as particle detectors.
A Simple GPU-Accelerated Two-Dimensional MUSCL-Hancock Solver for Ideal Magnetohydrodynamics
NASA Technical Reports Server (NTRS)
Bard, Christopher; Dorelli, John C.
2013-01-01
We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of approx. = 126 for a sq 1024 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.
Improved Air Combat Awareness; with AESA and Next-Generation Signal Processing
2002-09-01
competence network Building techniques Software development environment Communication Computer architecture Modeling Real-time programming Radar...memory access, skewed load and store, 3.2 GB/s BW • Performance: 400 MFLOPS Runtime environment Custom runtime routines Driver routines Hardware
YAPPA: a Compiler-Based Parallelization Framework for Irregular Applications on MPSoCs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lovergine, Silvia; Tumeo, Antonino; Villa, Oreste
Modern embedded systems include hundreds of cores. Because of the difficulty in providing a fast, coherent memory architecture, these systems usually rely on non-coherent, non-uniform memory architectures with private memories for each core. However, programming these systems poses significant challenges. The developer must extract large amounts of parallelism, while orchestrating communication among cores to optimize application performance. These issues become even more significant with irregular applications, which present data sets difficult to partition, unpredictable memory accesses, unbalanced control flow and fine grained communication. Hand-optimizing every single aspect is hard and time-consuming, and it often does not lead to the expectedmore » performance. There is a growing gap between such complex and highly-parallel architectures and the high level languages used to describe the specification, which were designed for simpler systems and do not consider these new issues. In this paper we introduce YAPPA (Yet Another Parallel Programming Approach), a compilation framework for the automatic parallelization of irregular applications on modern MPSoCs based on LLVM. We start by considering an efficient parallel programming approach for irregular applications on distributed memory systems. We then propose a set of transformations that can reduce the development and optimization effort. The results of our initial prototype confirm the correctness of the proposed approach.« less
Multiple core computer processor with globally-accessible local memories
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shalf, John; Donofrio, David; Oliker, Leonid
A multi-core computer processor including a plurality of processor cores interconnected in a Network-on-Chip (NoC) architecture, a plurality of caches, each of the plurality of caches being associated with one and only one of the plurality of processor cores, and a plurality of memories, each of the plurality of memories being associated with a different set of at least one of the plurality of processor cores and each of the plurality of memories being configured to be visible in a global memory address space such that the plurality of memories are visible to two or more of the plurality ofmore » processor cores.« less
An attention-gating recurrent working memory architecture for emergent speech representation
NASA Astrophysics Data System (ADS)
Elshaw, Mark; Moore, Roger K.; Klein, Michael
2010-06-01
This paper describes an attention-gating recurrent self-organising map approach for emergent speech representation. Inspired by evidence from human cognitive processing, the architecture combines two main neural components. The first component, the attention-gating mechanism, uses actor-critic learning to perform selective attention towards speech. Through this selective attention approach, the attention-gating mechanism controls access to working memory processing. The second component, the recurrent self-organising map memory, develops a temporal-distributed representation of speech using phone-like structures. Representing speech in terms of phonetic features in an emergent self-organised fashion, according to research on child cognitive development, recreates the approach found in infants. Using this representational approach, in a fashion similar to infants, should improve the performance of automatic recognition systems through aiding speech segmentation and fast word learning.
Fast associative memory + slow neural circuitry = the computational model of the brain.
NASA Astrophysics Data System (ADS)
Berkovich, Simon; Berkovich, Efraim; Lapir, Gennady
1997-08-01
We propose a computational model of the brain based on a fast associative memory and relatively slow neural processors. In this model, processing time is expensive but memory access is not, and therefore most algorithmic tasks would be accomplished by using large look-up tables as opposed to calculating. The essential feature of an associative memory in this context (characteristic for a holographic type memory) is that it works without an explicit mechanism for resolution of multiple responses. As a result, the slow neuronal processing elements, overwhelmed by the flow of information, operate as a set of templates for ranking of the retrieved information. This structure addresses the primary controversy in the brain architecture: distributed organization of memory vs. localization of processing centers. This computational model offers an intriguing explanation of many of the paradoxical features in the brain architecture, such as integration of sensors (through DMA mechanism), subliminal perception, universality of software, interrupts, fault-tolerance, certain bizarre possibilities for rapid arithmetics etc. In conventional computer science the presented type of a computational model did not attract attention as it goes against the technological grain by using a working memory faster than processing elements.
Designing Next Generation Massively Multithreaded Architectures for Irregular Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tumeo, Antonino; Secchi, Simone; Villa, Oreste
Irregular applications, such as data mining or graph-based computations, show unpredictable memory/network access patterns and control structures. Massively multi-threaded architectures with large node count, like the Cray XMT, have been shown to address their requirements better than commodity clusters. In this paper we present the approaches that we are currently pursuing to design future generations of these architectures. First, we introduce the Cray XMT and compare it to other multithreaded architectures. We then propose an evolution of the architecture, integrating multiple cores per node and next generation network interconnect. We advocate the use of hardware support for remote memory referencemore » aggregation to optimize network utilization. For this evaluation we developed a highly parallel, custom simulation infrastructure for multi-threaded systems. Our simulator executes unmodified XMT binaries with very large datasets, capturing effects due to contention and hot-spotting, while predicting execution times with greater than 90% accuracy. We also discuss the FPGA prototyping approach that we are employing to study efficient support for irregular applications in next generation manycore processors.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Castellana, Vito G.; Tumeo, Antonino; Ferrandi, Fabrizio
Emerging applications such as data mining, bioinformatics, knowledge discovery, social network analysis are irregular. They use data structures based on pointers or linked lists, such as graphs, unbalanced trees or unstructures grids, which generates unpredictable memory accesses. These data structures usually are large, but difficult to partition. These applications mostly are memory bandwidth bounded and have high synchronization intensity. However, they also have large amounts of inherent dynamic parallelism, because they potentially perform a task for each one of the element they are exploring. Several efforts are looking at accelerating these applications on hybrid architectures, which integrate general purpose processorsmore » with reconfigurable devices. Some solutions, which demonstrated significant speedups, include custom-hand tuned accelerators or even full processor architectures on the reconfigurable logic. In this paper we present an approach for the automatic synthesis of accelerators from C, targeted at irregular applications. In contrast to typical High Level Synthesis paradigms, which construct a centralized Finite State Machine, our approach generates dynamically scheduled hardware components. While parallelism exploitation in typical HLS-generated accelerators is usually bound within a single execution flow, our solution allows concurrently running multiple execution flow, thus also exploiting the coarser grain task parallelism of irregular applications. Our approach supports multiple, multi-ported and distributed memories, and atomic memory operations. Its main objective is parallelizing as many memory operations as possible, independently from their execution time, to maximize the memory bandwidth utilization. This significantly differs from current HLS flows, which usually consider a single memory port and require precise scheduling of memory operations. A key innovation of our approach is the generation of a memory interface controller, which dynamically maps concurrent memory accesses to multiple ports. We present a case study on a typical irregular kernel, Graph Breadth First search (BFS), exploring different tradeoffs in terms of parallelism and number of memories.« less
Evaluating architecture impact on system energy efficiency
Yu, Shijie; Wang, Rui; Luan, Zhongzhi; Qian, Depei
2017-01-01
As the energy consumption has been surging in an unsustainable way, it is important to understand the impact of existing architecture designs from energy efficiency perspective, which is especially valuable for High Performance Computing (HPC) and datacenter environment hosting tens of thousands of servers. One obstacle hindering the advance of comprehensive evaluation on energy efficiency is the deficient power measuring approach. Most of the energy study relies on either external power meters or power models, both of these two methods contain intrinsic drawbacks in their practical adoption and measuring accuracy. Fortunately, the advent of Intel Running Average Power Limit (RAPL) interfaces has promoted the power measurement ability into next level, with higher accuracy and finer time resolution. Therefore, we argue it is the exact time to conduct an in-depth evaluation of the existing architecture designs to understand their impact on system energy efficiency. In this paper, we leverage representative benchmark suites including serial and parallel workloads from diverse domains to evaluate the architecture features such as Non Uniform Memory Access (NUMA), Simultaneous Multithreading (SMT) and Turbo Boost. The energy is tracked at subcomponent level such as Central Processing Unit (CPU) cores, uncore components and Dynamic Random-Access Memory (DRAM) through exploiting the power measurement ability exposed by RAPL. The experiments reveal non-intuitive results: 1) the mismatch between local compute and remote memory node caused by NUMA effect not only generates dramatic power and energy surge but also deteriorates the energy efficiency significantly; 2) for multithreaded application such as the Princeton Application Repository for Shared-Memory Computers (PARSEC), most of the workloads benefit a notable increase of energy efficiency using SMT, with more than 40% decline in average power consumption; 3) Turbo Boost is effective to accelerate the workload execution and further preserve the energy, however it may not be applicable on system with tight power budget. PMID:29161317
Evaluating architecture impact on system energy efficiency.
Yu, Shijie; Yang, Hailong; Wang, Rui; Luan, Zhongzhi; Qian, Depei
2017-01-01
As the energy consumption has been surging in an unsustainable way, it is important to understand the impact of existing architecture designs from energy efficiency perspective, which is especially valuable for High Performance Computing (HPC) and datacenter environment hosting tens of thousands of servers. One obstacle hindering the advance of comprehensive evaluation on energy efficiency is the deficient power measuring approach. Most of the energy study relies on either external power meters or power models, both of these two methods contain intrinsic drawbacks in their practical adoption and measuring accuracy. Fortunately, the advent of Intel Running Average Power Limit (RAPL) interfaces has promoted the power measurement ability into next level, with higher accuracy and finer time resolution. Therefore, we argue it is the exact time to conduct an in-depth evaluation of the existing architecture designs to understand their impact on system energy efficiency. In this paper, we leverage representative benchmark suites including serial and parallel workloads from diverse domains to evaluate the architecture features such as Non Uniform Memory Access (NUMA), Simultaneous Multithreading (SMT) and Turbo Boost. The energy is tracked at subcomponent level such as Central Processing Unit (CPU) cores, uncore components and Dynamic Random-Access Memory (DRAM) through exploiting the power measurement ability exposed by RAPL. The experiments reveal non-intuitive results: 1) the mismatch between local compute and remote memory node caused by NUMA effect not only generates dramatic power and energy surge but also deteriorates the energy efficiency significantly; 2) for multithreaded application such as the Princeton Application Repository for Shared-Memory Computers (PARSEC), most of the workloads benefit a notable increase of energy efficiency using SMT, with more than 40% decline in average power consumption; 3) Turbo Boost is effective to accelerate the workload execution and further preserve the energy, however it may not be applicable on system with tight power budget.
Interfacing a high performance disk array file server to a Gigabit LAN
NASA Technical Reports Server (NTRS)
Seshan, Srinivasan; Katz, Randy H.
1993-01-01
Our previous prototype, RAID-1, identified several bottlenecks in typical file server architectures. The most important bottleneck was the lack of a high-bandwidth path between disk, memory, and the network. Workstation servers, such as the Sun-4/280, have very slow access to peripherals on busses far from the CPU. For the RAID-2 system, we addressed this problem by designing a crossbar interconnect, Xbus board, that provides a 40MB/s path between disk, memory, and the network interfaces. However, this interconnect does not provide the system CPU with low latency access to control the various interfaces. To provide a high data rate to clients on the network, we were forced to carefully and efficiently design the network software. A block diagram of the system hardware architecture is given. In the following subsections, we describe pieces of the RAID-2 file server hardware that had a significant impact on the design of the network interface.
Parallel Implementation of MAFFT on CUDA-Enabled Graphics Hardware.
Zhu, Xiangyuan; Li, Kenli; Salah, Ahmad; Shi, Lin; Li, Keqin
2015-01-01
Multiple sequence alignment (MSA) constitutes an extremely powerful tool for many biological applications including phylogenetic tree estimation, secondary structure prediction, and critical residue identification. However, aligning large biological sequences with popular tools such as MAFFT requires long runtimes on sequential architectures. Due to the ever increasing sizes of sequence databases, there is increasing demand to accelerate this task. In this paper, we demonstrate how graphic processing units (GPUs), powered by the compute unified device architecture (CUDA), can be used as an efficient computational platform to accelerate the MAFFT algorithm. To fully exploit the GPU's capabilities for accelerating MAFFT, we have optimized the sequence data organization to eliminate the bandwidth bottleneck of memory access, designed a memory allocation and reuse strategy to make full use of limited memory of GPUs, proposed a new modified-run-length encoding (MRLE) scheme to reduce memory consumption, and used high-performance shared memory to speed up I/O operations. Our implementation tested in three NVIDIA GPUs achieves speedup up to 11.28 on a Tesla K20m GPU compared to the sequential MAFFT 7.015.
Programming distributed memory architectures using Kali
NASA Technical Reports Server (NTRS)
Mehrotra, Piyush; Vanrosendale, John
1990-01-01
Programming nonshared memory systems is more difficult than programming shared memory systems, in part because of the relatively low level of current programming environments for such machines. A new programming environment is presented, Kali, which provides a global name space and allows direct access to remote data values. In order to retain efficiency, Kali provides a system on annotations, allowing the user to control those aspects of the program critical to performance, such as data distribution and load balancing. The primitives and constructs provided by the language is described, and some of the issues raised in translating a Kali program for execution on distributed memory systems are also discussed.
Real-time field programmable gate array architecture for computer vision
NASA Astrophysics Data System (ADS)
Arias-Estrada, Miguel; Torres-Huitzil, Cesar
2001-01-01
This paper presents an architecture for real-time generic convolution of a mask and an image. The architecture is intended for fast low-level image processing. The field programmable gate array (FPGA)-based architecture takes advantage of the availability of registers in FPGAs to implement an efficient and compact module to process the convolutions. The architecture is designed to minimize the number of accesses to the image memory and it is based on parallel modules with internal pipeline operation in order to improve its performance. The architecture is prototyped in a FPGA, but it can be implemented on dedicated very- large-scale-integrated devices to reach higher clock frequencies. Complexity issues, FPGA resources utilization, FPGA limitations, and real-time performance are discussed. Some results are presented and discussed.
Instruction-level performance modeling and characterization of multimedia applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Luo, Y.; Cameron, K.W.
1999-06-01
One of the challenges for characterizing and modeling realistic multimedia applications is the lack of access to source codes. On-chip performance counters effectively resolve this problem by monitoring run-time behaviors at the instruction-level. This paper presents a novel technique of characterizing and modeling workloads at the instruction level for realistic multimedia applications using hardware performance counters. A variety of instruction counts are collected from some multimedia applications, such as RealPlayer, GSM Vocoder, MPEG encoder/decoder, and speech synthesizer. These instruction counts can be used to form a set of abstract characteristic parameters directly related to a processor`s architectural features. Based onmore » microprocessor architectural constraints and these calculated abstract parameters, the architectural performance bottleneck for a specific application can be estimated. Meanwhile, the bottleneck estimation can provide suggestions about viable architectural/functional improvement for certain workloads. The biggest advantage of this new characterization technique is a better understanding of processor utilization efficiency and architectural bottleneck for each application. This technique also provides predictive insight of future architectural enhancements and their affect on current codes. In this paper the authors also attempt to model architectural effect on processor utilization without memory influence. They derive formulas for calculating CPI{sub 0}, CPI without memory effect, and they quantify utilization of architectural parameters. These equations are architecturally diagnostic and predictive in nature. Results provide promise in code characterization, and empirical/analytical modeling.« less
NASA Astrophysics Data System (ADS)
Chang, Che-Chia; Liu, Po-Tsun; Chien, Chen-Yu; Fan, Yang-Shun
2018-04-01
This study demonstrates the integration of a thin film transistor (TFT) and resistive random-access memory (RRAM) to form a one-transistor-one-resistor (1T1R) configuration. With the concept of the current conducting direction in RRAM and TFT, a triple-layer stack design of Pt/InGaZnO/Al2O3 is proposed for both the switching layer of RRAM and the channel layer of TFT. This proposal decreases the complexity of fabrication and the numbers of photomasks required. Also, the robust endurance and stable retention characteristics are exhibited by the 1T1R architecture for promising applications in memory-embedded flat panel displays.
NASA Astrophysics Data System (ADS)
Cortese, Simone; Khiat, Ali; Carta, Daniela; Light, Mark E.; Prodromakis, Themistoklis
2016-01-01
Resistive random access memory (ReRAM) crossbar arrays have become one of the most promising candidates for next-generation non volatile memories. To become a mature technology, the sneak path current issue must be solved without compromising all the advantages that crossbars offer in terms of electrical performances and fabrication complexity. Here, we present a highly integrable access device based on nickel and sub-stoichiometric amorphous titanium dioxide (TiO2-x), in a metal insulator metal crossbar structure. The high voltage margin of 3 V, amongst the highest reported for monolayer selector devices, and the good current density of 104 A/cm2 make it suitable to sustain ReRAM read and write operations, effectively tackling sneak currents in crossbars without compromising fabrication complexity in a 1 Selector 1 Resistor (1S1R) architecture. Furthermore, the voltage margin is found to be tunable by an annealing step without affecting the device's characteristics.
Parallelization of Program to Optimize Simulated Trajectories (POST3D)
NASA Technical Reports Server (NTRS)
Hammond, Dana P.; Korte, John J. (Technical Monitor)
2001-01-01
This paper describes the parallelization of the Program to Optimize Simulated Trajectories (POST3D). POST3D uses a gradient-based optimization algorithm that reaches an optimum design point by moving from one design point to the next. The gradient calculations required to complete the optimization process, dominate the computational time and have been parallelized using a Single Program Multiple Data (SPMD) on a distributed memory NUMA (non-uniform memory access) architecture. The Origin2000 was used for the tests presented.
Hardware architecture design of a fast global motion estimation method
NASA Astrophysics Data System (ADS)
Liang, Chaobing; Sang, Hongshi; Shen, Xubang
2015-12-01
VLSI implementation of gradient-based global motion estimation (GME) faces two main challenges: irregular data access and high off-chip memory bandwidth requirement. We previously proposed a fast GME method that reduces computational complexity by choosing certain number of small patches containing corners and using them in a gradient-based framework. A hardware architecture is designed to implement this method and further reduce off-chip memory bandwidth requirement. On-chip memories are used to store coordinates of the corners and template patches, while the Gaussian pyramids of both the template and reference frame are stored in off-chip SDRAMs. By performing geometric transform only on the coordinates of the center pixel of a 3-by-3 patch in the template image, a 5-by-5 area containing the warped 3-by-3 patch in the reference image is extracted from the SDRAMs by burst read. Patched-based and burst mode data access helps to keep the off-chip memory bandwidth requirement at the minimum. Although patch size varies at different pyramid level, all patches are processed in term of 3x3 patches, so the utilization of the patch-processing circuit reaches 100%. FPGA implementation results show that the design utilizes 24,080 bits on-chip memory and for a sequence with resolution of 352x288 and frequency of 60Hz, the off-chip bandwidth requirement is only 3.96Mbyte/s, compared with 243.84Mbyte/s of the original gradient-based GME method. This design can be used in applications like video codec, video stabilization, and super-resolution, where real-time GME is a necessity and minimum memory bandwidth requirement is appreciated.
Parallel processing approach to transform-based image coding
NASA Astrophysics Data System (ADS)
Normile, James O.; Wright, Dan; Chu, Ken; Yeh, Chia L.
1991-06-01
This paper describes a flexible parallel processing architecture designed for use in real time video processing. The system consists of floating point DSP processors connected to each other via fast serial links, each processor has access to a globally shared memory. A multiple bus architecture in combination with a dual ported memory allows communication with a host control processor. The system has been applied to prototyping of video compression and decompression algorithms. The decomposition of transform based algorithms for decompression into a form suitable for parallel processing is described. A technique for automatic load balancing among the processors is developed and discussed, results ar presented with image statistics and data rates. Finally techniques for accelerating the system throughput are analyzed and results from the application of one such modification described.
On VLSI Design of Rank-Order Filtering using DCRAM Architecture
Lin, Meng-Chun; Dung, Lan-Rong
2009-01-01
This paper addresses on VLSI design of rank-order filtering (ROF) with a maskable memory for real-time speech and image processing applications. Based on a generic bit-sliced ROF algorithm, the proposed design uses a special-defined memory, called the dual-cell random-access memory (DCRAM), to realize major operations of ROF: threshold decomposition and polarization. Using the memory-oriented architecture, the proposed ROF processor can benefit from high flexibility, low cost and high speed. The DCRAM can perform the bit-sliced read, partial write, and pipelined processing. The bit-sliced read and partial write are driven by maskable registers. With recursive execution of the bit-slicing read and partial write, the DCRAM can effectively realize ROF in terms of cost and speed. The proposed design has been implemented using TSMC 0.18 μm 1P6M technology. As shown in the result of physical implementation, the core size is 356.1 × 427.7μm2 and the VLSI implementation of ROF can operate at 256 MHz for 1.8V supply. PMID:19865599
NASA Astrophysics Data System (ADS)
Choi, Shinhyun; Tan, Scott H.; Li, Zefan; Kim, Yunjo; Choi, Chanyeol; Chen, Pai-Yu; Yeon, Hanwool; Yu, Shimeng; Kim, Jeehwan
2018-01-01
Although several types of architecture combining memory cells and transistors have been used to demonstrate artificial synaptic arrays, they usually present limited scalability and high power consumption. Transistor-free analog switching devices may overcome these limitations, yet the typical switching process they rely on—formation of filaments in an amorphous medium—is not easily controlled and hence hampers the spatial and temporal reproducibility of the performance. Here, we demonstrate analog resistive switching devices that possess desired characteristics for neuromorphic computing networks with minimal performance variations using a single-crystalline SiGe layer epitaxially grown on Si as a switching medium. Such epitaxial random access memories utilize threading dislocations in SiGe to confine metal filaments in a defined, one-dimensional channel. This confinement results in drastically enhanced switching uniformity and long retention/high endurance with a high analog on/off ratio. Simulations using the MNIST handwritten recognition data set prove that epitaxial random access memories can operate with an online learning accuracy of 95.1%.
A Low Cost Microcomputer Laboratory for Investigating Computer Architecture.
ERIC Educational Resources Information Center
Mitchell, Eugene E., Ed.
1980-01-01
Described is a microcomputer laboratory at the United States Military Academy at West Point, New York, which provides easy access to non-volatile memory and a single input/output file system for 16 microcomputer laboratory positions. A microcomputer network that has a centralized data base is implemented using the concepts of computer network…
NASA Astrophysics Data System (ADS)
Zellmann, Stefan; Percan, Yvonne; Lang, Ulrich
2015-01-01
Reconstruction of 2-d image primitives or of 3-d volumetric primitives is one of the most common operations performed by the rendering components of modern visualization systems. Because this operation is often aided by GPUs, reconstruction is typically restricted to first-order interpolation. With the advent of in situ visualization, the assumption that rendering algorithms are in general executed on GPUs is however no longer adequate. We thus propose a framework that provides versatile texture filtering capabilities: up to third-order reconstruction using various types of cubic filtering and interpolation primitives; cache-optimized algorithms that integrate seamlessly with GPGPU rendering or with software rendering that was optimized for cache-friendly "Structure of Array" (SoA) access patterns; a memory management layer (MML) that gracefully hides the complexities of extra data copies necessary for memory access optimizations such as swizzling, for rendering on GPGPUs, or for reconstruction schemes that rely on pre-filtered data arrays. We prove the effectiveness of our software architecture by integrating it into and validating it using the open source direct volume rendering (DVR) software DeskVOX.
Nonvolatile reconfigurable sequential logic in a HfO2 resistive random access memory array.
Zhou, Ya-Xiong; Li, Yi; Su, Yu-Ting; Wang, Zhuo-Rui; Shih, Ling-Yi; Chang, Ting-Chang; Chang, Kuan-Chang; Long, Shi-Bing; Sze, Simon M; Miao, Xiang-Shui
2017-05-25
Resistive random access memory (RRAM) based reconfigurable logic provides a temporal programmable dimension to realize Boolean logic functions and is regarded as a promising route to build non-von Neumann computing architecture. In this work, a reconfigurable operation method is proposed to perform nonvolatile sequential logic in a HfO 2 -based RRAM array. Eight kinds of Boolean logic functions can be implemented within the same hardware fabrics. During the logic computing processes, the RRAM devices in an array are flexibly configured in a bipolar or complementary structure. The validity was demonstrated by experimentally implemented NAND and XOR logic functions and a theoretically designed 1-bit full adder. With the trade-off between temporal and spatial computing complexity, our method makes better use of limited computing resources, thus provides an attractive scheme for the construction of logic-in-memory systems.
Feasibility of self-structured current accessed bubble devices in spacecraft recording systems
NASA Technical Reports Server (NTRS)
Nelson, G. L.; Krahn, D. R.; Dean, R. H.; Paul, M. C.; Lo, D. S.; Amundsen, D. L.; Stein, G. A.
1985-01-01
The self-structured, current aperture approach to magnetic bubble memory is described. Key results include: (1) demonstration that self-structured bubbles (a lattice of strongly interacting bubbles) will slip by one another in a storage loop at spacings of 2.5 bubble diameters, (2) the ability of self-structured bubbles to move past international fabrication defects (missing apertures) in the propagation conductors (defeat tolerance), and (3) moving bubbles at mobility limited speeds. Milled barriers in the epitaxial garnet are discussed for containment of the bubble lattice. Experimental work on input/output tracks, storage loops, gates, generators, and magneto-resistive detectors for a prototype device are discussed. Potential final device architectures are described with modeling of power consumption, data rates, and access times. Appendices compare the self-structured bubble memory from the device and system perspectives with other non-volatile memory technologies.
BIRD: A general interface for sparse distributed memory simulators
NASA Technical Reports Server (NTRS)
Rogers, David
1990-01-01
Kanerva's sparse distributed memory (SDM) has now been implemented for at least six different computers, including SUN3 workstations, the Apple Macintosh, and the Connection Machine. A common interface for input of commands would both aid testing of programs on a broad range of computer architectures and assist users in transferring results from research environments to applications. A common interface also allows secondary programs to generate command sequences for a sparse distributed memory, which may then be executed on the appropriate hardware. The BIRD program is an attempt to create such an interface. Simplifying access to different simulators should assist developers in finding appropriate uses for SDM.
FPGA cluster for high-performance AO real-time control system
NASA Astrophysics Data System (ADS)
Geng, Deli; Goodsell, Stephen J.; Basden, Alastair G.; Dipper, Nigel A.; Myers, Richard M.; Saunter, Chris D.
2006-06-01
Whilst the high throughput and low latency requirements for the next generation AO real-time control systems have posed a significant challenge to von Neumann architecture processor systems, the Field Programmable Gate Array (FPGA) has emerged as a long term solution with high performance on throughput and excellent predictability on latency. Moreover, FPGA devices have highly capable programmable interfacing, which lead to more highly integrated system. Nevertheless, a single FPGA is still not enough: multiple FPGA devices need to be clustered to perform the required subaperture processing and the reconstruction computation. In an AO real-time control system, the memory bandwidth is often the bottleneck of the system, simply because a vast amount of supporting data, e.g. pixel calibration maps and the reconstruction matrix, need to be accessed within a short period. The cluster, as a general computing architecture, has excellent scalability in processing throughput, memory bandwidth, memory capacity, and communication bandwidth. Problems, such as task distribution, node communication, system verification, are discussed.
Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Biswas, Rupak
1999-01-01
The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.
Gregarious Data Re-structuring in a Many Core Architecture
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shrestha, Sunil; Manzano Franco, Joseph B.; Marquez, Andres
this paper, we have developed a new methodology that takes in consideration the access patterns from a single parallel actor (e.g. a thread), as well as, the access patterns of “grouped” parallel actors that share a resource (e.g. a distributed Level 3 cache). We start with a hierarchical tile code for our target machine and apply a series of transformations at the tile level to improve data residence in a given memory hierarchy level. The contribution of this paper includes (a) collaborative data restructuring for group reuse and (b) low overhead transformation technique to improve access pattern and bring closelymore » connected data elements together. Preliminary results in a many core architecture, Tilera TileGX, shows promising improvements over optimized OpenMP code (up to 31% increase in GFLOPS) and over our own previous work on fine grained runtimes (up to 16%) for selected kernels« less
A polymer/semiconductor write-once read-many-times memory
NASA Astrophysics Data System (ADS)
Möller, Sven; Perlov, Craig; Jackson, Warren; Taussig, Carl; Forrest, Stephen R.
2003-11-01
Organic devices promise to revolutionize the extent of, and access to, electronics by providing extremely inexpensive, lightweight and capable ubiquitous components that are printed onto plastic, glass or metal foils. One key component of an electronic circuit that has thus far received surprisingly little attention is an organic electronic memory. Here we report an architecture for a write-once read-many-times (WORM) memory, based on the hybrid integration of an electrochromic polymer with a thin-film silicon diode deposited onto a flexible metal foil substrate. WORM memories are desirable for ultralow-cost permanent storage of digital images, eliminating the need for slow, bulky and expensive mechanical drives used in conventional magnetic and optical memories. Our results indicate that the hybrid organic/inorganic memory device is a reliable means for achieving rapid, large-scale archival data storage. The WORM memory pixel exploits a mechanism of current-controlled, thermally activated un-doping of a two-component electrochromic conducting polymer.
Virtual memory support for distributed computing environments using a shared data object model
NASA Astrophysics Data System (ADS)
Huang, F.; Bacon, J.; Mapp, G.
1995-12-01
Conventional storage management systems provide one interface for accessing memory segments and another for accessing secondary storage objects. This hinders application programming and affects overall system performance due to mandatory data copying and user/kernel boundary crossings, which in the microkernel case may involve context switches. Memory-mapping techniques may be used to provide programmers with a unified view of the storage system. This paper extends such techniques to support a shared data object model for distributed computing environments in which good support for coherence and synchronization is essential. The approach is based on a microkernel, typed memory objects, and integrated coherence control. A microkernel architecture is used to support multiple coherence protocols and the addition of new protocols. Memory objects are typed and applications can choose the most suitable protocols for different types of object to avoid protocol mismatch. Low-level coherence control is integrated with high-level concurrency control so that the number of messages required to maintain memory coherence is reduced and system-wide synchronization is realized without severely impacting the system performance. These features together contribute a novel approach to the support for flexible coherence under application control.
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Labarta, Jesus; Gimenez, Judit
2004-01-01
With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors, parallel programming techniques have evolved that support parallelism beyond a single level. When comparing the performance of applications based on different programming paradigms, it is important to differentiate between the influence of the programming model itself and other factors, such as implementation specific behavior of the operating system (OS) or architectural issues. Rewriting-a large scientific application in order to employ a new programming paradigms is usually a time consuming and error prone task. Before embarking on such an endeavor it is important to determine that there is really a gain that would not be possible with the current implementation. A detailed performance analysis is crucial to clarify these issues. The multilevel programming paradigms considered in this study are hybrid MPI/OpenMP, MLP, and nested OpenMP. The hybrid MPI/OpenMP approach is based on using MPI [7] for the coarse grained parallelization and OpenMP [9] for fine grained loop level parallelism. The MPI programming paradigm assumes a private address space for each process. Data is transferred by explicitly exchanging messages via calls to the MPI library. This model was originally designed for distributed memory architectures but is also suitable for shared memory systems. The second paradigm under consideration is MLP which was developed by Taft. The approach is similar to MPi/OpenMP, using a mix of coarse grain process level parallelization and loop level OpenMP parallelization. As it is the case with MPI, a private address space is assumed for each process. The MLP approach was developed for ccNUMA architectures and explicitly takes advantage of the availability of shared memory. A shared memory arena which is accessible by all processes is required. Communication is done by reading from and writing to the shared memory.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gebis, Joseph; Oliker, Leonid; Shalf, John
The disparity between microprocessor clock frequencies and memory latency is a primary reason why many demanding applications run well below peak achievable performance. Software controlled scratchpad memories, such as the Cell local store, attempt to ameliorate this discrepancy by enabling precise control over memory movement; however, scratchpad technology confronts the programmer and compiler with an unfamiliar and difficult programming model. In this work, we present the Virtual Vector Architecture (ViVA), which combines the memory semantics of vector computers with a software-controlled scratchpad memory in order to provide a more effective and practical approach to latency hiding. ViVA requires minimal changesmore » to the core design and could thus be easily integrated with conventional processor cores. To validate our approach, we implemented ViVA on the Mambo cycle-accurate full system simulator, which was carefully calibrated to match the performance on our underlying PowerPC Apple G5 architecture. Results show that ViVA is able to deliver significant performance benefits over scalar techniques for a variety of memory access patterns as well as two important memory-bound compact kernels, corner turn and sparse matrix-vector multiplication -- achieving 2x-13x improvement compared the scalar version. Overall, our preliminary ViVA exploration points to a promising approach for improving application performance on leading microprocessors with minimal design and complexity costs, in a power efficient manner.« less
NASA Astrophysics Data System (ADS)
He, Huimin; Liu, Fengman; Li, Baoxia; Xue, Haiyun; Wang, Haidong; Qiu, Delong; Zhou, Yunyan; Cao, Liqiang
2016-11-01
With the development of the multicore processor, the bandwidth and capacity of the memory, rather than the memory area, are the key factors in server performance. At present, however, the new architectures, such as fully buffered DIMM (FBDIMM), hybrid memory cube (HMC), and high bandwidth memory (HBM), cannot be commercially applied in the server. Therefore, a new architecture for the server is proposed. CPU and memory are separated onto different boards, and optical interconnection is used for the communication between them. Each optical module corresponds to each dual inline memory module (DIMM) with 64 channels. Compared to the previous technology, not only can the architecture realize high-capacity and wide-bandwidth memory, it also can reduce power consumption and cost, and be compatible with the existing dynamic random access memory (DRAM). In this article, the proposed module with system-in-package (SiP) integration is demonstrated. In the optical module, the silicon photonic chip is included, which is a promising technology to be applied in the next-generation data exchanging centers. And due to the bandwidth-distance performance of the optical interconnection, SerDes chips are introduced to convert the 64-bit data at 800 Mbps from/to 4-channel data at 12.8 Gbps after/before they are transmitted though optical fiber. All the devices are packaged on cheap organic substrates. To ensure the performance of the whole system, several optimization efforts have been performed on the two modules. High-speed interconnection traces have been designed and simulated with electromagnetic simulation software. Steady-state thermal characteristics of the transceiver module have been evaluated by ANSYS APLD based on finite-element methodology (FEM). Heat sinks are placed at the hotspot area to ensure the reliability of all working chips. Finally, this transceiver system based on silicon photonics is measured, and the eye diagrams of data and clock signals are verified.
Dynamic Photorefractive Memory and its Application for Opto-Electronic Neural Networks.
NASA Astrophysics Data System (ADS)
Sasaki, Hironori
This dissertation describes the analysis of the photorefractive crystal dynamics and its application for opto-electronic neural network systems. The realization of the dynamic photorefractive memory is investigated in terms of the following aspects: fast memory update, uniform grating multiplexing schedules and the prevention of the partial erasure of existing gratings. The fast memory update is realized by the selective erasure process that superimposes a new grating on the original one with an appropriate phase shift. The dynamics of the selective erasure process is analyzed using the first-order photorefractive material equations and experimentally confirmed. The effects of beam coupling and fringe bending on the selective erasure dynamics are also analyzed by numerically solving a combination of coupled wave equations and the photorefractive material equation. Incremental recording technique is proposed as a uniform grating multiplexing schedule and compared with the conventional scheduled recording technique in terms of phase distribution in the presence of an external dc electric field, as well as the image gray scale dependence. The theoretical analysis and experimental results proved the superiority of the incremental recording technique over the scheduled recording. Novel recirculating information memory architecture is proposed and experimentally demonstrated to prevent partial degradation of the existing gratings by accessing the memory. Gratings are circulated through a memory feed back loop based on the incremental recording dynamics and demonstrate robust read/write/erase capabilities. The dynamic photorefractive memory is applied to opto-electronic neural network systems. Module architecture based on the page-oriented dynamic photorefractive memory is proposed. This module architecture can implement two complementary interconnection organizations, fan-in and fan-out. The module system scalability and the learning capabilities are theoretically investigated using the photorefractive dynamics described in previous chapters of the dissertation. The implementation of the feed-forward image compression network with 900 input and 9 output neurons with 6-bit interconnection accuracy is experimentally demonstrated. Learning of the Perceptron network that determines sex based on input face images of 900 pixels is also successfully demonstrated.
Improved cache performance in Monte Carlo transport calculations using energy banding
NASA Astrophysics Data System (ADS)
Siegel, A.; Smith, K.; Felker, K.; Romano, P.; Forget, B.; Beckman, P.
2014-04-01
We present an energy banding algorithm for Monte Carlo (MC) neutral particle transport simulations which depend on large cross section lookup tables. In MC codes, read-only cross section data tables are accessed frequently, exhibit poor locality, and are typically too much large to fit in fast memory. Thus, performance is often limited by long latencies to RAM, or by off-node communication latencies when the data footprint is very large and must be decomposed on a distributed memory machine. The proposed energy banding algorithm allows maximal temporal reuse of data in band sizes that can flexibly accommodate different architectural features. The energy banding algorithm is general and has a number of benefits compared to the traditional approach. In the present analysis we explore its potential to achieve improvements in time-to-solution on modern cache-based architectures.
A highly efficient 3D level-set grain growth algorithm tailored for ccNUMA architecture
NASA Astrophysics Data System (ADS)
Mießen, C.; Velinov, N.; Gottstein, G.; Barrales-Mora, L. A.
2017-12-01
A highly efficient simulation model for 2D and 3D grain growth was developed based on the level-set method. The model introduces modern computational concepts to achieve excellent performance on parallel computer architectures. Strong scalability was measured on cache-coherent non-uniform memory access (ccNUMA) architectures. To achieve this, the proposed approach considers the application of local level-set functions at the grain level. Ideal and non-ideal grain growth was simulated in 3D with the objective to study the evolution of statistical representative volume elements in polycrystals. In addition, microstructure evolution in an anisotropic magnetic material affected by an external magnetic field was simulated.
Computer vision camera with embedded FPGA processing
NASA Astrophysics Data System (ADS)
Lecerf, Antoine; Ouellet, Denis; Arias-Estrada, Miguel
2000-03-01
Traditional computer vision is based on a camera-computer system in which the image understanding algorithms are embedded in the computer. To circumvent the computational load of vision algorithms, low-level processing and imaging hardware can be integrated in a single compact module where a dedicated architecture is implemented. This paper presents a Computer Vision Camera based on an open architecture implemented in an FPGA. The system is targeted to real-time computer vision tasks where low level processing and feature extraction tasks can be implemented in the FPGA device. The camera integrates a CMOS image sensor, an FPGA device, two memory banks, and an embedded PC for communication and control tasks. The FPGA device is a medium size one equivalent to 25,000 logic gates. The device is connected to two high speed memory banks, an IS interface, and an imager interface. The camera can be accessed for architecture programming, data transfer, and control through an Ethernet link from a remote computer. A hardware architecture can be defined in a Hardware Description Language (like VHDL), simulated and synthesized into digital structures that can be programmed into the FPGA and tested on the camera. The architecture of a classical multi-scale edge detection algorithm based on a Laplacian of Gaussian convolution has been developed to show the capabilities of the system.
FPGA-Based, Self-Checking, Fault-Tolerant Computers
NASA Technical Reports Server (NTRS)
Some, Raphael; Rennels, David
2004-01-01
A proposed computer architecture would exploit the capabilities of commercially available field-programmable gate arrays (FPGAs) to enable computers to detect and recover from bit errors. The main purpose of the proposed architecture is to enable fault-tolerant computing in the presence of single-event upsets (SEUs). [An SEU is a spurious bit flip (also called a soft error) caused by a single impact of ionizing radiation.] The architecture would also enable recovery from some soft errors caused by electrical transients and, to some extent, from intermittent and permanent (hard) errors caused by aging of electronic components. A typical FPGA of the current generation contains one or more complete processor cores, memories, and highspeed serial input/output (I/O) channels, making it possible to shrink a board-level processor node to a single integrated-circuit chip. Custom, highly efficient microcontrollers, general-purpose computers, custom I/O processors, and signal processors can be rapidly and efficiently implemented by use of FPGAs. Unfortunately, FPGAs are susceptible to SEUs. Prior efforts to mitigate the effects of SEUs have yielded solutions that degrade performance of the system and require support from external hardware and software. In comparison with other fault-tolerant- computing architectures (e.g., triple modular redundancy), the proposed architecture could be implemented with less circuitry and lower power demand. Moreover, the fault-tolerant computing functions would require only minimal support from circuitry outside the central processing units (CPUs) of computers, would not require any software support, and would be largely transparent to software and to other computer hardware. There would be two types of modules: a self-checking processor module and a memory system (see figure). The self-checking processor module would be implemented on a single FPGA and would be capable of detecting its own internal errors. It would contain two CPUs executing identical programs in lock step, with comparison of their outputs to detect errors. It would also contain various cache local memory circuits, communication circuits, and configurable special-purpose processors that would use self-checking checkers. (The basic principle of the self-checking checker method is to utilize logic circuitry that generates error signals whenever there is an error in either the checker or the circuit being checked.) The memory system would comprise a main memory and a hardware-controlled check-pointing system (CPS) based on a buffer memory denoted the recovery cache. The main memory would contain random-access memory (RAM) chips and FPGAs that would, in addition to everything else, implement double-error-detecting and single-error-correcting memory functions to enable recovery from single-bit errors.
Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures
2017-10-04
Report: Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures The views, opinions and/or findings contained in this...Chapel Hill Title: Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures Report Term: 0-Other Email: dm...algorithms for scientific and geometric computing by exploiting the power and performance efficiency of heterogeneous shared memory architectures . These
DOE Office of Scientific and Technical Information (OSTI.GOV)
Panyala, Ajay; Chavarría-Miranda, Daniel; Manzano, Joseph B.
High performance, parallel applications with irregular data accesses are becoming a critical workload class for modern systems. In particular, the execution of such workloads on emerging many-core systems is expected to be a significant component of applications in data mining, machine learning, scientific computing and graph analytics. However, power and energy constraints limit the capabilities of individual cores, memory hierarchy and on-chip interconnect of such systems, thus leading to architectural and software trade-os that must be understood in the context of the intended application’s behavior. Irregular applications are notoriously hard to optimize given their data-dependent access patterns, lack of structuredmore » locality and complex data structures and code patterns. We have ported two irregular applications, graph community detection using the Louvain method (Grappolo) and high-performance conjugate gradient (HPCCG), to the Tilera many-core system and have conducted a detailed study of platform-independent and platform-specific optimizations that improve their performance as well as reduce their overall energy consumption. To conduct this study, we employ an auto-tuning based approach that explores the optimization design space along three dimensions - memory layout schemes, GCC compiler flag choices and OpenMP loop scheduling options. We leverage MIT’s OpenTuner auto-tuning framework to explore and recommend energy optimal choices for different combinations of parameters. We then conduct an in-depth architectural characterization to understand the memory behavior of the selected workloads. Finally, we perform a correlation study to demonstrate the interplay between the hardware behavior and application characteristics. Using auto-tuning, we demonstrate whole-node energy savings and performance improvements of up to 49:6% and 60% relative to a baseline instantiation, and up to 31% and 45:4% relative to manually optimized variants.« less
Metal oxide resistive random access memory based synaptic devices for brain-inspired computing
NASA Astrophysics Data System (ADS)
Gao, Bin; Kang, Jinfeng; Zhou, Zheng; Chen, Zhe; Huang, Peng; Liu, Lifeng; Liu, Xiaoyan
2016-04-01
The traditional Boolean computing paradigm based on the von Neumann architecture is facing great challenges for future information technology applications such as big data, the Internet of Things (IoT), and wearable devices, due to the limited processing capability issues such as binary data storage and computing, non-parallel data processing, and the buses requirement between memory units and logic units. The brain-inspired neuromorphic computing paradigm is believed to be one of the promising solutions for realizing more complex functions with a lower cost. To perform such brain-inspired computing with a low cost and low power consumption, novel devices for use as electronic synapses are needed. Metal oxide resistive random access memory (ReRAM) devices have emerged as the leading candidate for electronic synapses. This paper comprehensively addresses the recent work on the design and optimization of metal oxide ReRAM-based synaptic devices. A performance enhancement methodology and optimized operation scheme to achieve analog resistive switching and low-energy training behavior are provided. A three-dimensional vertical synapse network architecture is proposed for high-density integration and low-cost fabrication. The impacts of the ReRAM synaptic device features on the performances of neuromorphic systems are also discussed on the basis of a constructed neuromorphic visual system with a pattern recognition function. Possible solutions to achieve the high recognition accuracy and efficiency of neuromorphic systems are presented.
PCM-Based Durable Write Cache for Fast Disk I/O
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, Zhuo; Wang, Bin; Carpenter, Patrick
2012-01-01
Flash based solid-state devices (FSSDs) have been adopted within the memory hierarchy to improve the performance of hard disk drive (HDD) based storage system. However, with the fast development of storage-class memories, new storage technologies with better performance and higher write endurance than FSSDs are emerging, e.g., phase-change memory (PCM). Understanding how to leverage these state-of-the-art storage technologies for modern computing systems is important to solve challenging data intensive computing problems. In this paper, we propose to leverage PCM for a hybrid PCM-HDD storage architecture. We identify the limitations of traditional LRU caching algorithms for PCM-based caches, and develop amore » novel hash-based write caching scheme called HALO to improve random write performance of hard disks. To address the limited durability of PCM devices and solve the degraded spatial locality in traditional wear-leveling techniques, we further propose novel PCM management algorithms that provide effective wear-leveling while maximizing access parallelism. We have evaluated this PCM-based hybrid storage architecture using applications with a diverse set of I/O access patterns. Our experimental results demonstrate that the HALO caching scheme leads to an average reduction of 36.8% in execution time compared to the LRU caching scheme, and that the SFC wear leveling extends the lifetime of PCM by a factor of 21.6.« less
Description and Simulation of a Fast Packet Switch Architecture for Communication Satellites
NASA Technical Reports Server (NTRS)
Quintana, Jorge A.; Lizanich, Paul J.
1995-01-01
The NASA Lewis Research Center has been developing the architecture for a multichannel communications signal processing satellite (MCSPS) as part of a flexible, low-cost meshed-VSAT (very small aperture terminal) network. The MCSPS architecture is based on a multifrequency, time-division-multiple-access (MF-TDMA) uplink and a time-division multiplex (TDM) downlink. There are eight uplink MF-TDMA beams, and eight downlink TDM beams, with eight downlink dwells per beam. The information-switching processor, which decodes, stores, and transmits each packet of user data to the appropriate downlink dwell onboard the satellite, has been fully described by using VHSIC (Very High Speed Integrated-Circuit) Hardware Description Language (VHDL). This VHDL code, which was developed in-house to simulate the information switching processor, showed that the architecture is both feasible and viable. This paper describes a shared-memory-per-beam architecture, its VHDL implementation, and the simulation efforts.
Efficient checkpointing schemes for depletion perturbation solutions on memory-limited architectures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Stripling, H. F.; Adams, M. L.; Hawkins, W. D.
2013-07-01
We describe a methodology for decreasing the memory footprint and machine I/O load associated with the need to access a forward solution during an adjoint solve. Specifically, we are interested in the depletion perturbation equations, where terms in the adjoint Bateman and transport equations depend on the forward flux solution. Checkpointing is the procedure of storing snapshots of the forward solution to disk and using these snapshots to recompute the parts of the forward solution that are necessary for the adjoint solve. For large problems, however, the storage cost of just a few copies of an angular flux vector canmore » exceed the available RAM on the host machine. We propose a methodology that does not checkpoint the angular flux vector; instead, we write and store converged source moments, which are typically of a much lower dimension than the angular flux solution. This reduces the memory footprint and I/O load of the problem, but requires that we perform single sweeps to reconstruct flux vectors on demand. We argue that this trade-off is exactly the kind of algorithm that will scale on advanced, memory-limited architectures. We analyze the cost, in terms of FLOPS and memory footprint, of five checkpointing schemes. We also provide computational results that support the analysis and show that the memory-for-work trade off does improve time to solution. (authors)« less
Enabling the High Level Synthesis of Data Analytics Accelerators
DOE Office of Scientific and Technical Information (OSTI.GOV)
Minutoli, Marco; Castellana, Vito G.; Tumeo, Antonino
Conventional High Level Synthesis (HLS) tools mainly tar- get compute intensive kernels typical of digital signal pro- cessing applications. We are developing techniques and ar- chitectural templates to enable HLS of data analytics appli- cations. These applications are memory intensive, present fine-grained, unpredictable data accesses, and irregular, dy- namic task parallelism. We discuss an architectural tem- plate based around a distributed controller to efficiently ex- ploit thread level parallelism. We present a memory in- terface that supports parallel memory subsystems and en- ables implementing atomic memory operations. We intro- duce a dynamic task scheduling approach to efficiently ex- ecute heavilymore » unbalanced workload. The templates are val- idated by synthesizing queries from the Lehigh University Benchmark (LUBM), a well know SPARQL benchmark.« less
Scalable quantum memory in the ultrastrong coupling regime.
Kyaw, T H; Felicetti, S; Romero, G; Solano, E; Kwek, L-C
2015-03-02
Circuit quantum electrodynamics, consisting of superconducting artificial atoms coupled to on-chip resonators, represents a prime candidate to implement the scalable quantum computing architecture because of the presence of good tunability and controllability. Furthermore, recent advances have pushed the technology towards the ultrastrong coupling regime of light-matter interaction, where the qubit-resonator coupling strength reaches a considerable fraction of the resonator frequency. Here, we propose a qubit-resonator system operating in that regime, as a quantum memory device and study the storage and retrieval of quantum information in and from the Z2 parity-protected quantum memory, within experimentally feasible schemes. We are also convinced that our proposal might pave a way to realize a scalable quantum random-access memory due to its fast storage and readout performances.
Scalable quantum memory in the ultrastrong coupling regime
Kyaw, T. H.; Felicetti, S.; Romero, G.; Solano, E.; Kwek, L.-C.
2015-01-01
Circuit quantum electrodynamics, consisting of superconducting artificial atoms coupled to on-chip resonators, represents a prime candidate to implement the scalable quantum computing architecture because of the presence of good tunability and controllability. Furthermore, recent advances have pushed the technology towards the ultrastrong coupling regime of light-matter interaction, where the qubit-resonator coupling strength reaches a considerable fraction of the resonator frequency. Here, we propose a qubit-resonator system operating in that regime, as a quantum memory device and study the storage and retrieval of quantum information in and from the Z2 parity-protected quantum memory, within experimentally feasible schemes. We are also convinced that our proposal might pave a way to realize a scalable quantum random-access memory due to its fast storage and readout performances. PMID:25727251
The storage system of PCM based on random access file system
NASA Astrophysics Data System (ADS)
Han, Wenbing; Chen, Xiaogang; Zhou, Mi; Li, Shunfen; Li, Gezi; Song, Zhitang
2016-10-01
Emerging memory technologies such as Phase change memory (PCM) tend to offer fast, random access to persistent storage with better scalability. It's a hot topic of academic and industrial research to establish PCM in storage hierarchy to narrow the performance gap. However, the existing file systems do not perform well with the emerging PCM storage, which access storage medium via a slow, block-based interface. In this paper, we propose a novel file system, RAFS, to bring about good performance of PCM, which is built in the embedded platform. We attach PCM chips to the memory bus and build RAFS on the physical address space. In the proposed file system, we simplify traditional system architecture to eliminate block-related operations and layers. Furthermore, we adopt memory mapping and bypassed page cache to reduce copy overhead between the process address space and storage device. XIP mechanisms are also supported in RAFS. To the best of our knowledge, we are among the first to implement file system on real PCM chips. We have analyzed and evaluated its performance with IOZONE benchmark tools. Our experimental results show that the RAFS on PCM outperforms Ext4fs on SDRAM with small record lengths. Based on DRAM, RAFS is significantly faster than Ext4fs by 18% to 250%.
Scalable Parallel Density-based Clustering and Applications
NASA Astrophysics Data System (ADS)
Patwary, Mostofa Ali
2014-04-01
Recently, density-based clustering algorithms (DBSCAN and OPTICS) have gotten significant attention of the scientific community due to their unique capability of discovering arbitrary shaped clusters and eliminating noise data. These algorithms have several applications, which require high performance computing, including finding halos and subhalos (clusters) from massive cosmology data in astrophysics, analyzing satellite images, X-ray crystallography, and anomaly detection. However, parallelization of these algorithms are extremely challenging as they exhibit inherent sequential data access order, unbalanced workload resulting in low parallel efficiency. To break the data access sequentiality and to achieve high parallelism, we develop new parallel algorithms, both for DBSCAN and OPTICS, designed using graph algorithmic techniques. For example, our parallel DBSCAN algorithm exploits the similarities between DBSCAN and computing connected components. Using datasets containing up to a billion floating point numbers, we show that our parallel density-based clustering algorithms significantly outperform the existing algorithms, achieving speedups up to 27.5 on 40 cores on shared memory architecture and speedups up to 5,765 using 8,192 cores on distributed memory architecture. In our experiments, we found that while achieving the scalability, our algorithms produce clustering results with comparable quality to the classical algorithms.
NASA Astrophysics Data System (ADS)
Erez, Mattan; Dally, William J.
Stream processors, like other multi core architectures partition their functional units and storage into multiple processing elements. In contrast to typical architectures, which contain symmetric general-purpose cores and a cache hierarchy, stream processors have a significantly leaner design. Stream processors are specifically designed for the stream execution model, in which applications have large amounts of explicit parallel computation, structured and predictable control, and memory accesses that can be performed at a coarse granularity. Applications in the streaming model are expressed in a gather-compute-scatter form, yielding programs with explicit control over transferring data to and from on-chip memory. Relying on these characteristics, which are common to many media processing and scientific computing applications, stream architectures redefine the boundary between software and hardware responsibilities with software bearing much of the complexity required to manage concurrency, locality, and latency tolerance. Thus, stream processors have minimal control consisting of fetching medium- and coarse-grained instructions and executing them directly on the many ALUs. Moreover, the on-chip storage hierarchy of stream processors is under explicit software control, as is all communication, eliminating the need for complex reactive hardware mechanisms.
Contemporary Spaces of Memory - Towards Transdisciplinarity in Architecture
NASA Astrophysics Data System (ADS)
Kabrońska, Joanna
2017-10-01
The paper explores new phenomena in the contemporary practice of commemoration implemented through architecture. Architectural objects related to memory can be a place where new trends and phenomena appear earlier than in other architectural objects. The text is an attempt to prove that these new spaces of memory are a kind of laboratory where new ideas taking place in architecture and related disciplines are being tested. Research focuses on the bond between the complex and difficult problem of memory and the issue of transdisciplinarity in architecture. Over the last few decades architecture has been - in comparison to other areas - a relatively closed domain of knowledge. Contemporary places of memory - different from the traditional - may be the evidence of changes. On the basis of theoretical approaches, interdisciplinary surveys, in-field analyses and case studies the paper give insight into the relationships between architecture and other areas, emerging in the recently created spaces of memory of different types. The text indicates that today both the study and the design of such places is difficult without going beyond the field of architecture. There is a need for further extensive research, but the paper confirms the potential of this research direction. Spaces of memory offer the opportunity to capture the transformation of the discipline at the moment when the process begins.
Implementation of collisions on GPU architecture in the Vorpal code
NASA Astrophysics Data System (ADS)
Leddy, Jarrod; Averkin, Sergey; Cowan, Ben; Sides, Scott; Werner, Greg; Cary, John
2017-10-01
The Vorpal code contains a variety of collision operators allowing for the simulation of plasmas containing multiple charge species interacting with neutrals, background gas, and EM fields. These existing algorithms have been improved and reimplemented to take advantage of the massive parallelization allowed by GPU architecture. The use of GPUs is most effective when algorithms are single-instruction multiple-data, so particle collisions are an ideal candidate for this parallelization technique due to their nature as a series of independent processes with the same underlying operation. This refactoring required data memory reorganization and careful consideration of device/host data allocation to minimize memory access and data communication per operation. Successful implementation has resulted in an order of magnitude increase in simulation speed for a test-case involving multiple binary collisions using the null collision method. Work supported by DARPA under contract W31P4Q-16-C-0009.
An architecture for real-time vision processing
NASA Technical Reports Server (NTRS)
Chien, Chiun-Hong
1994-01-01
To study the feasibility of developing an architecture for real time vision processing, a task queue server and parallel algorithms for two vision operations were designed and implemented on an i860-based Mercury Computing System 860VS array processor. The proposed architecture treats each vision function as a task or set of tasks which may be recursively divided into subtasks and processed by multiple processors coordinated by a task queue server accessible by all processors. Each idle processor subsequently fetches a task and associated data from the task queue server for processing and posts the result to shared memory for later use. Load balancing can be carried out within the processing system without the requirement for a centralized controller. The author concludes that real time vision processing cannot be achieved without both sequential and parallel vision algorithms and a good parallel vision architecture.
First experience of vectorizing electromagnetic physics models for detector simulation
NASA Astrophysics Data System (ADS)
Amadio, G.; Apostolakis, J.; Bandieramonte, M.; Bianchini, C.; Bitzes, G.; Brun, R.; Canal, P.; Carminati, F.; de Fine Licht, J.; Duhem, L.; Elvira, D.; Gheata, A.; Jun, S. Y.; Lima, G.; Novak, M.; Presbyterian, M.; Shadura, O.; Seghal, R.; Wenzel, S.
2015-12-01
The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. The GeantV vector prototype for detector simulations has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth, parallelization needed to achieve optimal performance or memory access latency and speed. An additional challenge is to avoid the code duplication often inherent to supporting heterogeneous platforms. In this paper we present the first experience of vectorizing electromagnetic physics models developed for the GeantV project.
A Case Study on Neural Inspired Dynamic Memory Management Strategies for High Performance Computing.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vineyard, Craig Michael; Verzi, Stephen Joseph
As high performance computing architectures pursue more computational power there is a need for increased memory capacity and bandwidth as well. A multi-level memory (MLM) architecture addresses this need by combining multiple memory types with different characteristics as varying levels of the same architecture. How to efficiently utilize this memory infrastructure is an unknown challenge, and in this research we sought to investigate whether neural inspired approaches can meaningfully help with memory management. In particular we explored neurogenesis inspired re- source allocation, and were able to show a neural inspired mixed controller policy can beneficially impact how MLM architectures utilizemore » memory.« less
Multicore Architecture-aware Scientific Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Srinivasa, Avinash
Modern high performance systems are becoming increasingly complex and powerful due to advancements in processor and memory architecture. In order to keep up with this increasing complexity, applications have to be augmented with certain capabilities to fully exploit such systems. These may be at the application level, such as static or dynamic adaptations or at the system level, like having strategies in place to override some of the default operating system polices, the main objective being to improve computational performance of the application. The current work proposes two such capabilites with respect to multi-threaded scientific applications, in particular a largemore » scale physics application computing ab-initio nuclear structure. The first involves using a middleware tool to invoke dynamic adaptations in the application, so as to be able to adjust to the changing computational resource availability at run-time. The second involves a strategy for effective placement of data in main memory, to optimize memory access latencies and bandwidth. These capabilties when included were found to have a significant impact on the application performance, resulting in average speedups of as much as two to four times.« less
Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bernstein, P.A.
1988-02-01
The Sequoia computer is a tightly coupled multiprocessor, and thus attains the performance advantages of this style of architecture. It avoids most of the fault-tolerance disadvantages of tight coupling by using a new fault-tolerance design. The Sequoia architecture is similar to other multimicroprocessor architectures, such as those of Encore and Sequent, in that it gives dozens of microprocessors shared access to a large main memory. It resembles the Stratus architecture in its extensive use of hardware fault-detection techniques. It resembles Stratus and Auragen in its ability to quickly recover all processes after a single point failure, transparently to the user.more » However, Sequoia is unique in its combination of a large-scale tightly coupled architecture with a hardware approach to fault tolerance. This article gives an overview of how the hardware architecture and operating systems (OS) work together to provide a high degree of fault tolerance with good system performance.« less
Design and Analysis of Architectures for Structural Health Monitoring Systems
NASA Technical Reports Server (NTRS)
Mukkamala, Ravi; Sixto, S. L. (Technical Monitor)
2002-01-01
During the two-year project period, we have worked on several aspects of Health Usage and Monitoring Systems for structural health monitoring. In particular, we have made contributions in the following areas. 1. Reference HUMS architecture: We developed a high-level architecture for health monitoring and usage systems (HUMS). The proposed reference architecture is shown. It is compatible with the Generic Open Architecture (GOA) proposed as a standard for avionics systems. 2. HUMS kernel: One of the critical layers of HUMS reference architecture is the HUMS kernel. We developed a detailed design of a kernel to implement the high level architecture.3. Prototype implementation of HUMS kernel: We have implemented a preliminary version of the HUMS kernel on a Unix platform.We have implemented both a centralized system version and a distributed version. 4. SCRAMNet and HUMS: SCRAMNet (Shared Common Random Access Memory Network) is a system that is found to be suitable to implement HUMS. For this reason, we have conducted a simulation study to determine its stability in handling the input data rates in HUMS. 5. Architectural specification.
CoNNeCT Baseband Processor Module
NASA Technical Reports Server (NTRS)
Yamamoto, Clifford K; Jedrey, Thomas C.; Gutrich, Daniel G.; Goodpasture, Richard L.
2011-01-01
A document describes the CoNNeCT Baseband Processor Module (BPM) based on an updated processor, memory technology, and field-programmable gate arrays (FPGAs). The BPM was developed from a requirement to provide sufficient computing power and memory storage to conduct experiments for a Software Defined Radio (SDR) to be implemented. The flight SDR uses the AT697 SPARC processor with on-chip data and instruction cache. The non-volatile memory has been increased from a 20-Mbit EEPROM (electrically erasable programmable read only memory) to a 4-Gbit Flash, managed by the RTAX2000 Housekeeper, allowing more programs and FPGA bit-files to be stored. The volatile memory has been increased from a 20-Mbit SRAM (static random access memory) to a 1.25-Gbit SDRAM (synchronous dynamic random access memory), providing additional memory space for more complex operating systems and programs to be executed on the SPARC. All memory is EDAC (error detection and correction) protected, while the SPARC processor implements fault protection via TMR (triple modular redundancy) architecture. Further capability over prior BPM designs includes the addition of a second FPGA to implement features beyond the resources of a single FPGA. Both FPGAs are implemented with Xilinx Virtex-II and are interconnected by a 96-bit bus to facilitate data exchange. Dedicated 1.25- Gbit SDRAMs are wired to each Xilinx FPGA to accommodate high rate data buffering for SDR applications as well as independent SpaceWire interfaces. The RTAX2000 manages scrub and configuration of each Xilinx.
NASA Astrophysics Data System (ADS)
Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide
2015-09-01
The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
Two-dimensional systolic-array architecture for pixel-level vision tasks
NASA Astrophysics Data System (ADS)
Vijverberg, Julien A.; de With, Peter H. N.
2010-05-01
This paper presents ongoing work on the design of a two-dimensional (2D) systolic array for image processing. This component is designed to operate on a multi-processor system-on-chip. In contrast with other 2D systolic-array architectures and many other hardware accelerators, we investigate the applicability of executing multiple tasks in a time-interleaved fashion on the Systolic Array (SA). This leads to a lower external memory bandwidth and better load balancing of the tasks on the different processing tiles. To enable the interleaving of tasks, we add a shadow-state register for fast task switching. To reduce the number of accesses to the external memory, we propose to share the communication assist between consecutive tasks. A preliminary, non-functional version of the SA has been synthesized for an XV4S25 FPGA device and yields a maximum clock frequency of 150 MHz requiring 1,447 slices and 5 memory blocks. Mapping tasks from video content-analysis applications from literature on the SA yields reductions in the execution time of 1-2 orders of magnitude compared to the software implementation. We conclude that the choice for an SA architecture is useful, but a scaled version of the SA featuring less logic with fewer processing and pipeline stages yielding a lower clock frequency, would be sufficient for a video analysis system-on-chip.
HTMT-class Latency Tolerant Parallel Architecture for Petaflops Scale Computation
NASA Technical Reports Server (NTRS)
Sterling, Thomas; Bergman, Larry
2000-01-01
Computational Aero Sciences and other numeric intensive computation disciplines demand computing throughputs substantially greater than the Teraflops scale systems only now becoming available. The related fields of fluids, structures, thermal, combustion, and dynamic controls are among the interdisciplinary areas that in combination with sufficient resolution and advanced adaptive techniques may force performance requirements towards Petaflops. This will be especially true for compute intensive models such as Navier-Stokes are or when such system models are only part of a larger design optimization computation involving many design points. Yet recent experience with conventional MPP configurations comprising commodity processing and memory components has shown that larger scale frequently results in higher programming difficulty and lower system efficiency. While important advances in system software and algorithms techniques have had some impact on efficiency and programmability for certain classes of problems, in general it is unlikely that software alone will resolve the challenges to higher scalability. As in the past, future generations of high-end computers may require a combination of hardware architecture and system software advances to enable efficient operation at a Petaflops level. The NASA led HTMT project has engaged the talents of a broad interdisciplinary team to develop a new strategy in high-end system architecture to deliver petaflops scale computing in the 2004/5 timeframe. The Hybrid-Technology, MultiThreaded parallel computer architecture incorporates several advanced technologies in combination with an innovative dynamic adaptive scheduling mechanism to provide unprecedented performance and efficiency within practical constraints of cost, complexity, and power consumption. The emerging superconductor Rapid Single Flux Quantum electronics can operate at 100 GHz (the record is 770 GHz) and one percent of the power required by convention semiconductor logic. Wave Division Multiplexing optical communications can approach a peak per fiber bandwidth of 1 Tbps and the new Data Vortex network topology employing this technology can connect tens of thousands of ports providing a bi-section bandwidth on the order of a Petabyte per second with latencies well below 100 nanoseconds, even under heavy loads. Processor-in-Memory (PIM) technology combines logic and memory on the same chip exposing the internal bandwidth of the memory row buffers at low latency. And holographic storage photorefractive storage technologies provide high-density memory with access a thousand times faster than conventional disk technologies. Together these technologies enable a new class of shared memory system architecture with a peak performance in the range of a Petaflops but size and power requirements comparable to today's largest Teraflops scale systems. To achieve high-sustained performance, HTMT combines an advanced multithreading processor architecture with a memory-driven coarse-grained latency management strategy called "percolation", yielding high efficiency while reducing the much of the parallel programming burden. This paper will present the basic system architecture characteristics made possible through this series of advanced technologies and then give a detailed description of the new percolation approach to runtime latency management.
NASA Astrophysics Data System (ADS)
Leggett, C.; Binet, S.; Jackson, K.; Levinthal, D.; Tatarkhanov, M.; Yao, Y.
2011-12-01
Thermal limitations have forced CPU manufacturers to shift from simply increasing clock speeds to improve processor performance, to producing chip designs with multi- and many-core architectures. Further the cores themselves can run multiple threads as a zero overhead context switch allowing low level resource sharing (Intel Hyperthreading). To maximize bandwidth and minimize memory latency, memory access has become non uniform (NUMA). As manufacturers add more cores to each chip, a careful understanding of the underlying architecture is required in order to fully utilize the available resources. We present AthenaMP and the Atlas event loop manager, the driver of the simulation and reconstruction engines, which have been rewritten to make use of multiple cores, by means of event based parallelism, and final stage I/O synchronization. However, initial studies on 8 andl6 core Intel architectures have shown marked non-linearities as parallel process counts increase, with as much as 30% reductions in event throughput in some scenarios. Since the Intel Nehalem architecture (both Gainestown and Westmere) will be the most common choice for the next round of hardware procurements, an understanding of these scaling issues is essential. Using hardware based event counters and Intel's Performance Tuning Utility, we have studied the performance bottlenecks at the hardware level, and discovered optimization schemes to maximize processor throughput. We have also produced optimization mechanisms, common to all large experiments, that address the extreme nature of today's HEP code, which due to it's size, places huge burdens on the memory infrastructure of today's processors.
Benchmarking high performance computing architectures with CMS’ skeleton framework
NASA Astrophysics Data System (ADS)
Sexton-Kennedy, E.; Gartung, P.; Jones, C. D.
2017-10-01
In 2012 CMS evaluated which underlying concurrency technology would be the best to use for its multi-threaded framework. The available technologies were evaluated on the high throughput computing systems dominating the resources in use at that time. A skeleton framework benchmarking suite that emulates the tasks performed within a CMSSW application was used to select Intel’s Thread Building Block library, based on the measured overheads in both memory and CPU on the different technologies benchmarked. In 2016 CMS will get access to high performance computing resources that use new many core architectures; machines such as Cori Phase 1&2, Theta, Mira. Because of this we have revived the 2012 benchmark to test it’s performance and conclusions on these new architectures. This talk will discuss the results of this exercise.
An expert system environment for the Generic VHSIC Spaceborne Computer (GVSC)
NASA Astrophysics Data System (ADS)
Cockerham, Ann; Labhart, Jay; Rowe, Michael; Skinner, James
The authors describe a Phase II Phillips Laboratory Small Business Innovative Research (SBIR) program being performed to implement a flexible and general-purpose inference environment for embedded space and avionics applications. This inference environment is being developed in Ada and takes special advantage of the target architecture, the GVSC. The GVSC implements the MIL-STD-1750A ISA and contains enhancements to allow access of up to 8 MBytes of memory. The inference environment makes use of the Merit Enhanced Traversal Engine (METE) algorithm, which employs the latest inference and knowledge representation strategies to optimize both run-time speed and memory utilization.
An Evaluation of the TRIPS Computer System (Extended Technical Report)
2008-07-08
Mario Marino Nitya Ranganathan Behnam Robatmili Aaron Smith James Burrill Stephen W. Keckler Doug Burger Kathryn S. McKinley Computer Architecture and...Marino, Nitya Ranganathan , Behnam Robatmili, Aaron Smith, James Burrill, Stephen W. Keckler, Doug Burger, Kathryn S. McKinley; ASPLOS 2009, Washington DC...aggressively register allo- cate more memory accesses by using programmer knowledge about pointer aliasing, much of which may be automated. They also
NASA Technical Reports Server (NTRS)
Dongarra, Jack
1998-01-01
This exploratory study initiated our inquiry into algorithms and applications that would benefit by latency tolerant approach to algorithm building, including the construction of new algorithms where appropriate. In a multithreaded execution, when a processor reaches a point where remote memory access is necessary, the request is sent out on the network and a context--switch occurs to a new thread of computation. This effectively masks a long and unpredictable latency due to remote loads, thereby providing tolerance to remote access latency. We began to develop standards to profile various algorithm and application parameters, such as the degree of parallelism, granularity, precision, instruction set mix, interprocessor communication, latency etc. These tools will continue to develop and evolve as the Information Power Grid environment matures. To provide a richer context for this research, the project also focused on issues of fault-tolerance and computation migration of numerical algorithms and software. During the initial phase we tried to increase our understanding of the bottlenecks in single processor performance. Our work began by developing an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units. Based on the results we achieved in this study we are planning to study other architectures of interest, including development of cost models, and developing code generators appropriate to these architectures.
An open architecture for medical image workstation
NASA Astrophysics Data System (ADS)
Liang, Liang; Hu, Zhiqiang; Wang, Xiangyun
2005-04-01
Dealing with the difficulties of integrating various medical image viewing and processing technologies with a variety of clinical and departmental information systems and, in the meantime, overcoming the performance constraints in transferring and processing large-scale and ever-increasing image data in healthcare enterprise, we design and implement a flexible, usable and high-performance architecture for medical image workstations. This architecture is not developed for radiology only, but for any workstations in any application environments that may need medical image retrieving, viewing, and post-processing. This architecture contains an infrastructure named Memory PACS and different kinds of image applications built on it. The Memory PACS is in charge of image data caching, pre-fetching and management. It provides image applications with a high speed image data access and a very reliable DICOM network I/O. In dealing with the image applications, we use dynamic component technology to separate the performance-constrained modules from the flexibility-constrained modules so that different image viewing or processing technologies can be developed and maintained independently. We also develop a weakly coupled collaboration service, through which these image applications can communicate with each other or with third party applications. We applied this architecture in developing our product line and it works well. In our clinical sites, this architecture is applied not only in Radiology Department, but also in Ultrasonic, Surgery, Clinics, and Consultation Center. Giving that each concerned department has its particular requirements and business routines along with the facts that they all have different image processing technologies and image display devices, our workstations are still able to maintain high performance and high usability.
Cognitive Architectures for Multimedia Learning
ERIC Educational Resources Information Center
Reed, Stephen K.
2006-01-01
This article provides a tutorial overview of cognitive architectures that can form a theoretical foundation for designing multimedia instruction. Cognitive architectures include a description of memory stores, memory codes, and cognitive operations. Architectures that are relevant to multimedia learning include Paivio's dual coding theory,…
MPEG-1 low-cost encoder solution
NASA Astrophysics Data System (ADS)
Grueger, Klaus; Schirrmeister, Frank; Filor, Lutz; von Reventlow, Christian; Schneider, Ulrich; Mueller, Gerriet; Sefzik, Nicolai; Fiedrich, Sven
1995-02-01
A solution for real-time compression of digital YCRCB video data to an MPEG-1 video data stream has been developed. As an additional option, motion JPEG and video telephone streams (H.261) can be generated. For MPEG-1, up to two bidirectional predicted images are supported. The required computational power for motion estimation and DCT/IDCT, memory size and memory bandwidth have been the main challenges. The design uses fast-page-mode memory accesses and requires only one single 80 ns EDO-DRAM with 256 X 16 organization for video encoding. This can be achieved only by using adequate access and coding strategies. The architecture consists of an input processing and filter unit, a memory interface, a motion estimation unit, a motion compensation unit, a DCT unit, a quantization control, a VLC unit and a bus interface. For using the available memory bandwidth by the processing tasks, a fixed schedule for memory accesses has been applied, that can be interrupted for asynchronous events. The motion estimation unit implements a highly sophisticated hierarchical search strategy based on block matching. The DCT unit uses a separated fast-DCT flowgraph realized by a switchable hardware unit for both DCT and IDCT operation. By appropriate multiplexing, only one multiplier is required for: DCT, quantization, inverse quantization, and IDCT. The VLC unit generates the video-stream up to the video sequence layer and is directly coupled with an intelligent bus-interface. Thus, the assembly of video, audio and system data can easily be performed by the host computer. Having a relatively low complexity and only small requirements for DRAM circuits, the developed solution can be applied to low-cost encoding products for consumer electronics.
Cheng, Xue-Feng; Hou, Xiang; Qian, Wen-Hu; He, Jing-Hui; Xu, Qing-Feng; Li, Hua; Li, Na-Jun; Chen, Dong-Yun; Lu, Jian-Mei
2017-08-23
Herein, for the first time, quaternary resistive memory based on an organic molecule is achieved via surface engineering. A layer of poly(3,4-ethylenedioxythiophene)-poly(styrenesulfonate) (PEDOT-PSS) was inserted between the indium tin oxide (ITO) electrode and the organic layer (squaraine, SA-Bu) to form an ITO/PEDOT-PSS/SA-Bu/Al architecture. The modified resistive random-access memory (RRAM) devices achieve quaternary memory switching with the highest yield (∼41%) to date. Surface morphology, crystallinity, and mosaicity of the deposited organic grains are greatly improved after insertion of a PEDOT-PSS interlayer, which provides better contacts at the grain boundaries as well as the electrode/active layer interface. The PEDOT-PSS interlayer also reduces the hole injection barrier from the electrode to the active layer. Thus, the threshold voltage of each switching is greatly reduced, allowing for more quaternary switching in a certain voltage window. Our results provide a simple yet powerful strategy as an alternative to molecular design to achieve organic quaternary resistive memory.
Achieving High Performance With TCP Over 40 GbE on NUMA Architectures for CMS Data Acquisition
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bawej, Tomasz; et al.
2014-01-01
TCP and the socket abstraction have barely changed over the last two decades, but at the network layer there has been a giant leap from a few megabits to 100 gigabits in bandwidth. At the same time, CPU architectures have evolved into the multicore era and applications are expected to make full use of all available resources. Applications in the data acquisition domain based on the standard socket library running in a Non-Uniform Memory Access (NUMA) architecture are unable to reach full efficiency and scalability without the software being adequately aware about the IRQ (Interrupt Request), CPU and memory affinities.more » During the first long shutdown of LHC, the CMS DAQ system is going to be upgraded for operation from 2015 onwards and a new software component has been designed and developed in the CMS online framework for transferring data with sockets. This software attempts to wrap the low-level socket library to ease higher-level programming with an API based on an asynchronous event driven model similar to the DAT uDAPL API. It is an event-based application with NUMA optimizations, that allows for a high throughput of data across a large distributed system. This paper describes the architecture, the technologies involved and the performance measurements of the software in the context of the CMS distributed event building.« less
NASA Astrophysics Data System (ADS)
Megherbi, Dalila B.; Yan, Yin; Tanmay, Parikh; Khoury, Jed; Woods, C. L.
2004-11-01
Recently surveillance and Automatic Target Recognition (ATR) applications are increasing as the cost of computing power needed to process the massive amount of information continues to fall. This computing power has been made possible partly by the latest advances in FPGAs and SOPCs. In particular, to design and implement state-of-the-Art electro-optical imaging systems to provide advanced surveillance capabilities, there is a need to integrate several technologies (e.g. telescope, precise optics, cameras, image/compute vision algorithms, which can be geographically distributed or sharing distributed resources) into a programmable system and DSP systems. Additionally, pattern recognition techniques and fast information retrieval, are often important components of intelligent systems. The aim of this work is using embedded FPGA as a fast, configurable and synthesizable search engine in fast image pattern recognition/retrieval in a distributed hardware/software co-design environment. In particular, we propose and show a low cost Content Addressable Memory (CAM)-based distributed embedded FPGA hardware architecture solution with real time recognition capabilities and computing for pattern look-up, pattern recognition, and image retrieval. We show how the distributed CAM-based architecture offers a performance advantage of an order-of-magnitude over RAM-based architecture (Random Access Memory) search for implementing high speed pattern recognition for image retrieval. The methods of designing, implementing, and analyzing the proposed CAM based embedded architecture are described here. Other SOPC solutions/design issues are covered. Finally, experimental results, hardware verification, and performance evaluations using both the Xilinx Virtex-II and the Altera Apex20k are provided to show the potential and power of the proposed method for low cost reconfigurable fast image pattern recognition/retrieval at the hardware/software co-design level.
Benchmarking high performance computing architectures with CMS’ skeleton framework
Sexton-Kennedy, E.; Gartung, P.; Jones, C. D.
2017-11-23
Here, in 2012 CMS evaluated which underlying concurrency technology would be the best to use for its multi-threaded framework. The available technologies were evaluated on the high throughput computing systems dominating the resources in use at that time. A skeleton framework benchmarking suite that emulates the tasks performed within a CMSSW application was used to select Intel’s Thread Building Block library, based on the measured overheads in both memory and CPU on the different technologies benchmarked. In 2016 CMS will get access to high performance computing resources that use new many core architectures; machines such as Cori Phase 1&2, Theta,more » Mira. Because of this we have revived the 2012 benchmark to test it’s performance and conclusions on these new architectures. This talk will discuss the results of this exercise.« less
Benchmarking high performance computing architectures with CMS’ skeleton framework
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sexton-Kennedy, E.; Gartung, P.; Jones, C. D.
Here, in 2012 CMS evaluated which underlying concurrency technology would be the best to use for its multi-threaded framework. The available technologies were evaluated on the high throughput computing systems dominating the resources in use at that time. A skeleton framework benchmarking suite that emulates the tasks performed within a CMSSW application was used to select Intel’s Thread Building Block library, based on the measured overheads in both memory and CPU on the different technologies benchmarked. In 2016 CMS will get access to high performance computing resources that use new many core architectures; machines such as Cori Phase 1&2, Theta,more » Mira. Because of this we have revived the 2012 benchmark to test it’s performance and conclusions on these new architectures. This talk will discuss the results of this exercise.« less
Optimal Configuration and Deployment of Software on Multi-Core Processing Architectures
2008-07-01
between the event generating threads and the collector thread is implemented through semaphores . The Perseus data logger is designed to minimize the...performance counters (through the PAPI API) and opens up access to the shared memory logger through a semaphore and Remote Procedure Call (RPC) buffer... synchronization events. Using this rich data, the TMAM is able to output all of the information necessary to identify precisely which pairs of thread
On the robustness of bucket brigade quantum RAM
NASA Astrophysics Data System (ADS)
Arunachalam, Srinivasan; Gheorghiu, Vlad; Jochym-O'Connor, Tomas; Mosca, Michele; Varshinee Srinivasan, Priyaa
2015-12-01
We study the robustness of the bucket brigade quantum random access memory model introduced by Giovannetti et al (2008 Phys. Rev. Lett.100 160501). Due to a result of Regev and Schiff (ICALP ’08 733), we show that for a class of error models the error rate per gate in the bucket brigade quantum memory has to be of order o({2}-n/2) (where N={2}n is the size of the memory) whenever the memory is used as an oracle for the quantum searching problem. We conjecture that this is the case for any realistic error model that will be encountered in practice, and that for algorithms with super-polynomially many oracle queries the error rate must be super-polynomially small, which further motivates the need for quantum error correction. By contrast, for algorithms such as matrix inversion Harrow et al (2009 Phys. Rev. Lett.103 150502) or quantum machine learning Rebentrost et al (2014 Phys. Rev. Lett.113 130503) that only require a polynomial number of queries, the error rate only needs to be polynomially small and quantum error correction may not be required. We introduce a circuit model for the quantum bucket brigade architecture and argue that quantum error correction for the circuit causes the quantum bucket brigade architecture to lose its primary advantage of a small number of ‘active’ gates, since all components have to be actively error corrected.
Support for Diagnosis of Custom Computer Hardware
NASA Technical Reports Server (NTRS)
Molock, Dwaine S.
2008-01-01
The Coldfire SDN Diagnostics software is a flexible means of exercising, testing, and debugging custom computer hardware. The software is a set of routines that, collectively, serve as a common software interface through which one can gain access to various parts of the hardware under test and/or cause the hardware to perform various functions. The routines can be used to construct tests to exercise, and verify the operation of, various processors and hardware interfaces. More specifically, the software can be used to gain access to memory, to execute timer delays, to configure interrupts, and configure processor cache, floating-point, and direct-memory-access units. The software is designed to be used on diverse NASA projects, and can be customized for use with different processors and interfaces. The routines are supported, regardless of the architecture of a processor that one seeks to diagnose. The present version of the software is configured for Coldfire processors on the Subsystem Data Node processor boards of the Solar Dynamics Observatory. There is also support for the software with respect to Mongoose V, RAD750, and PPC405 processors or their equivalents.
Shared direct memory access on the Explorer 2-LX
NASA Technical Reports Server (NTRS)
Musgrave, Jeffrey L.
1990-01-01
Advances in Expert System technology and Artificial Intelligence have provided a framework for applying automated Intelligence to the solution of problems which were generally perceived as intractable using more classical approaches. As a result, hybrid architectures and parallel processing capability have become more common in computing environments. The Texas Instruments Explorer II-LX is an example of a machine which combines a symbolic processing environment, and a computationally oriented environment in a single chassis for integrated problem solutions. This user's manual is an attempt to make these capabilities more accessible to a wider range of engineers and programmers with problems well suited to solution in such an environment.
On the Usage of GPUs for Efficient Motion Estimation in Medical Image Sequences
Thiyagalingam, Jeyarajan; Goodman, Daniel; Schnabel, Julia A.; Trefethen, Anne; Grau, Vicente
2011-01-01
Images are ubiquitous in biomedical applications from basic research to clinical practice. With the rapid increase in resolution, dimensionality of the images and the need for real-time performance in many applications, computational requirements demand proper exploitation of multicore architectures. Towards this, GPU-specific implementations of image analysis algorithms are particularly promising. In this paper, we investigate the mapping of an enhanced motion estimation algorithm to novel GPU-specific architectures, the resulting challenges and benefits therein. Using a database of three-dimensional image sequences, we show that the mapping leads to substantial performance gains, up to a factor of 60, and can provide near-real-time experience. We also show how architectural peculiarities of these devices can be best exploited in the benefit of algorithms, most specifically for addressing the challenges related to their access patterns and different memory configurations. Finally, we evaluate the performance of the algorithm on three different GPU architectures and perform a comprehensive analysis of the results. PMID:21869880
2014-06-19
decision theory (Berger, 1985), and quantum probability theory (Busemeyer, Pothos, Franco, & Trueblood, 2011). Similarly, explanations in the 1 rch in...trengths of associations) are consciously inaccessible and con- titute the implicit knowledge of the model (Gonzalez & Lebiere, 005; Lebiere, Wallach...in memory (Lebiere et al., 2013). It is important to note hat this wholly implicit process is not consciously available to the odel. The second level
Integrated Short Range, Low Bandwidth, Wearable Communications Networking Technologies
2012-04-30
Only (FOUO) Table of Contents Introduction 7 Research Discussions 7 1 Specifications 8 2 SAN Radio 9 2.1 R.F. Design Improvements 9 2.1.1 LNA...Characterization and Verification Testing 26 2.2 Digital Design Improvements 26 2.2.1 Improve Processor Access to Memory Resources 26 2.2.2...integrated and tested . A hybrid architecture of the automatic gain control (AGC) was designed to Page 7 of 116 For Official Use Only (FOUO
Repeatable Reverse Engineering with the Platform for Architecture-Neutral Dynamic Analysis
2015-09-18
record and replay functionality: on a live execution, the amount of compute resources needed to identify and halt on every memory access and inspect...and iteratively, running a replay of the previously gathered recording over and over to construct a deeper understanding of the important aspects of...system events happen during the replay . A second analysis pass over the replay might focus in on the activity of a particular program or a portion of the
Genome accessibility is widely preserved and locally modulated during mitosis
Hsiung, Chris C.-S.; Morrissey, Christapher S.; Udugama, Maheshi; Frank, Christopher L.; Keller, Cheryl A.; Baek, Songjoon; Giardine, Belinda; Crawford, Gregory E.; Sung, Myong-Hee; Hardison, Ross C.
2015-01-01
Mitosis entails global alterations to chromosome structure and nuclear architecture, concomitant with transient silencing of transcription. How cells transmit transcriptional states through mitosis remains incompletely understood. While many nuclear factors dissociate from mitotic chromosomes, the observation that certain nuclear factors and chromatin features remain associated with individual loci during mitosis originated the hypothesis that such mitotically retained molecular signatures could provide transcriptional memory through mitosis. To understand the role of chromatin structure in mitotic memory, we performed the first genome-wide comparison of DNase I sensitivity of chromatin in mitosis and interphase, using a murine erythroblast model. Despite chromosome condensation during mitosis visible by microscopy, the landscape of chromatin accessibility at the macromolecular level is largely unaltered. However, mitotic chromatin accessibility is locally dynamic, with individual loci maintaining none, some, or all of their interphase accessibility. Mitotic reduction in accessibility occurs primarily within narrow, highly DNase hypersensitive sites that frequently coincide with transcription factor binding sites, whereas broader domains of moderate accessibility tend to be more stable. In mitosis, proximal promoters generally maintain their accessibility more strongly, whereas distal regulatory elements tend to lose accessibility. Large domains of DNA hypomethylation mark a subset of promoters that retain accessibility during mitosis and across many cell types in interphase. Erythroid transcription factor GATA1 exerts site-specific changes in interphase accessibility that are most pronounced at distal regulatory elements, but has little influence on mitotic accessibility. We conclude that features of open chromatin are remarkably stable through mitosis, but are modulated at the level of individual genes and regulatory elements. PMID:25373146
Nonvolatile Memory Materials for Neuromorphic Intelligent Machines.
Jeong, Doo Seok; Hwang, Cheol Seong
2018-04-18
Recent progress in deep learning extends the capability of artificial intelligence to various practical tasks, making the deep neural network (DNN) an extremely versatile hypothesis. While such DNN is virtually built on contemporary data centers of the von Neumann architecture, physical (in part) DNN of non-von Neumann architecture, also known as neuromorphic computing, can remarkably improve learning and inference efficiency. Particularly, resistance-based nonvolatile random access memory (NVRAM) highlights its handy and efficient application to the multiply-accumulate (MAC) operation in an analog manner. Here, an overview is given of the available types of resistance-based NVRAMs and their technological maturity from the material- and device-points of view. Examples within the strategy are subsequently addressed in comparison with their benchmarks (virtual DNN in deep learning). A spiking neural network (SNN) is another type of neural network that is more biologically plausible than the DNN. The successful incorporation of resistance-based NVRAM in SNN-based neuromorphic computing offers an efficient solution to the MAC operation and spike timing-based learning in nature. This strategy is exemplified from a material perspective. Intelligent machines are categorized according to their architecture and learning type. Also, the functionality and usefulness of NVRAM-based neuromorphic computing are addressed. © 2018 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
NASA Technical Reports Server (NTRS)
Soltis, Steven R.; Ruwart, Thomas M.; OKeefe, Matthew T.
1996-01-01
The global file system (GFS) is a prototype design for a distributed file system in which cluster nodes physically share storage devices connected via a network-like fiber channel. Networks and network-attached storage devices have advanced to a level of performance and extensibility so that the previous disadvantages of shared disk architectures are no longer valid. This shared storage architecture attempts to exploit the sophistication of storage device technologies whereas a server architecture diminishes a device's role to that of a simple component. GFS distributes the file system responsibilities across processing nodes, storage across the devices, and file system resources across the entire storage pool. GFS caches data on the storage devices instead of the main memories of the machines. Consistency is established by using a locking mechanism maintained by the storage devices to facilitate atomic read-modify-write operations. The locking mechanism is being prototyped in the Silicon Graphics IRIX operating system and is accessed using standard Unix commands and modules.
NASA Astrophysics Data System (ADS)
Yang, Chen; Liu, LeiBo; Yin, ShouYi; Wei, ShaoJun
2014-12-01
The computational capability of a coarse-grained reconfigurable array (CGRA) can be significantly restrained due to data and context memory bandwidth bottlenecks. Traditionally, two methods have been used to resolve this problem. One method loads the context into the CGRA at run time. This method occupies very small on-chip memory but induces very large latency, which leads to low computational efficiency. The other method adopts a multi-context structure. This method loads the context into the on-chip context memory at the boot phase. Broadcasting the pointer of a set of contexts changes the hardware configuration on a cycle-by-cycle basis. The size of the context memory induces a large area overhead in multi-context structures, which results in major restrictions on application complexity. This paper proposes a Predictable Context Cache (PCC) architecture to address the above context issues by buffering the context inside a CGRA. In this architecture, context is dynamically transferred into the CGRA. Utilizing a PCC significantly reduces the on-chip context memory and the complexity of the applications running on the CGRA is no longer restricted by the size of the on-chip context memory. Data preloading is the most frequently used approach to hide input data latency and speed up the data transmission process for the data bandwidth issue. Rather than fundamentally reducing the amount of input data, the transferred data and computations are processed in parallel. However, the data preloading method cannot work efficiently because data transmission becomes the critical path as the reconfigurable array scale increases. This paper also presents a Hierarchical Data Memory (HDM) architecture as a solution to the efficiency problem. In this architecture, high internal bandwidth is provided to buffer both reused input data and intermediate data. The HDM architecture relieves the external memory from the data transfer burden so that the performance is significantly improved. As a result of using PCC and HDM, experiments running mainstream video decoding programs achieved performance improvements of 13.57%-19.48% when there was a reasonable memory size. Therefore, 1080p@35.7fps for H.264 high profile video decoding can be achieved on PCC and HDM architecture when utilizing a 200 MHz working frequency. Further, the size of the on-chip context memory no longer restricted complex applications, which were efficiently executed on the PCC and HDM architecture.
1994-05-01
PARALLEL DISTRIBUTED MEMORY ARCHITECTURE LTJh T. M. Eidson 0 - 8 l 9 5 " G. Erlebacher _ _ _. _ DTIe QUALITY INSPECTED a Contract NAS I - 19480 May 1994...DISTRIBUTED MEMORY ARCHITECTURE T.M. Eidson * High Technology Corporation Hampton, VA 23665 G. Erlebachert Institute for Computer Applications in Science and...developed and evaluated. Simple model calculations as well as timing results are pres.nted to evaluate the various strategies. The particular
Calculating Reuse Distance from Source Code
DOE Office of Scientific and Technical Information (OSTI.GOV)
Narayanan, Sri Hari Krishna; Hovland, Paul
The efficient use of a system is of paramount importance in high-performance computing. Applications need to be engineered for future systems even before the architecture of such a system is clearly known. Static performance analysis that generates performance bounds is one way to approach the task of understanding application behavior. Performance bounds provide an upper limit on the performance of an application on a given architecture. Predicting cache hierarchy behavior and accesses to main memory is a requirement for accurate performance bounds. This work presents our static reuse distance algorithm to generate reuse distance histograms. We then use these histogramsmore » to predict cache miss rates. Experimental results for kernels studied show that the approach is accurate.« less
A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TETRAHEDRAL DOMAINS
Fu, Zhisong; Kirby, Robert M.; Whitaker, Ross T.
2014-01-01
Generating numerical solutions to the eikonal equation and its many variations has a broad range of applications in both the natural and computational sciences. Efficient solvers on cutting-edge, parallel architectures require new algorithms that may not be theoretically optimal, but that are designed to allow asynchronous solution updates and have limited memory access patterns. This paper presents a parallel algorithm for solving the eikonal equation on fully unstructured tetrahedral meshes. The method is appropriate for the type of fine-grained parallelism found on modern massively-SIMD architectures such as graphics processors and takes into account the particular constraints and capabilities of these computing platforms. This work builds on previous work for solving these equations on triangle meshes; in this paper we adapt and extend previous two-dimensional strategies to accommodate three-dimensional, unstructured, tetrahedralized domains. These new developments include a local update strategy with data compaction for tetrahedral meshes that provides solutions on both serial and parallel architectures, with a generalization to inhomogeneous, anisotropic speed functions. We also propose two new update schemes, specialized to mitigate the natural data increase observed when moving to three dimensions, and the data structures necessary for efficiently mapping data to parallel SIMD processors in a way that maintains computational density. Finally, we present descriptions of the implementations for a single CPU, as well as multicore CPUs with shared memory and SIMD architectures, with comparative results against state-of-the-art eikonal solvers. PMID:25221418
Dynamically programmable cache
NASA Astrophysics Data System (ADS)
Nakkar, Mouna; Harding, John A.; Schwartz, David A.; Franzon, Paul D.; Conte, Thomas
1998-10-01
Reconfigurable machines have recently been used as co- processors to accelerate the execution of certain algorithms or program subroutines. The problems with the above approach include high reconfiguration time and limited partial reconfiguration. By far the most critical problems are: (1) the small on-chip memory which results in slower execution time, and (2) small FPGA areas that cannot implement large subroutines. Dynamically Programmable Cache (DPC) is a novel architecture for embedded processors which offers solutions to the above problems. To solve memory access problems, DPC processors merge reconfigurable arrays with the data cache at various cache levels to create a multi-level reconfigurable machines. As a result DPC machines have both higher data accessibility and FPGA memory bandwidth. To solve the limited FPGA resource problem, DPC processors implemented multi-context switching (Virtualization) concept. Virtualization allows implementation of large subroutines with fewer FPGA cells. Additionally, DPC processors can parallelize the execution of several operations resulting in faster execution time. In this paper, the speedup improvement for DPC machines are shown to be 5X faster than an Altera FLEX10K FPGA chip and 2X faster than a Sun Ultral SPARC station for two different algorithms (convolution and motion estimation).
Linguistic representations and memory architectures: The devil is in the details.
Chacón, Dustin Alfonso; Momma, Shota; Phillips, Colin
2016-01-01
Attempts to explain linguistic phenomena as consequences of memory constraints require detailed specification of linguistic representations and memory architectures alike. We discuss examples of supposed locality biases in language comprehension and production, and their link to memory constraints. Findings do not generally favor Christiansen & Chater's (C&C's) approach. We discuss connections to debates that stretch back to the nineteenth century.
CLOCS (Computer with Low Context-Switching Time) Architecture Reference Documents
1988-05-06
Peculiarities The only state inside the central processing unit(CPU) is a program status word. All data operations are memory to memory. One result of this... to the challenge "if I whore to design RISC, this is how I would do it." The architecture was designed by Mark Davis and Bill Gallmeister. 1.2...are memory to memory. Any special devices added should be memory mapped. The program counter is even memory mapped. 1.3.1 Working storage There is no
Intelligent holographic databases
NASA Astrophysics Data System (ADS)
Barbastathis, George
Memory is a key component of intelligence. In the human brain, physical structure and functionality jointly provide diverse memory modalities at multiple time scales. How could we engineer artificial memories with similar faculties? In this thesis, we attack both hardware and algorithmic aspects of this problem. A good part is devoted to holographic memory architectures, because they meet high capacity and parallelism requirements. We develop and fully characterize shift multiplexing, a novel storage method that simplifies disk head design for holographic disks. We develop and optimize the design of compact refreshable holographic random access memories, showing several ways that 1 Tbit can be stored holographically in volume less than 1 m3, with surface density more than 20 times higher than conventional silicon DRAM integrated circuits. To address the issue of photorefractive volatility, we further develop the two-lambda (dual wavelength) method for shift multiplexing, and combine electrical fixing with angle multiplexing to demonstrate 1,000 multiplexed fixed holograms. Finally, we propose a noise model and an information theoretic metric to optimize the imaging system of a holographic memory, in terms of storage density and error rate. Motivated by the problem of interfacing sensors and memories to a complex system with limited computational resources, we construct a computer game of Desert Survival, built as a high-dimensional non-stationary virtual environment in a competitive setting. The efficacy of episodic learning, implemented as a reinforced Nearest Neighbor scheme, and the probability of winning against a control opponent improve significantly by concentrating the algorithmic effort to the virtual desert neighborhood that emerges as most significant at any time. The generalized computational model combines the autonomous neural network and von Neumann paradigms through a compact, dynamic central representation, which contains the most salient features of the sensory inputs, fused with relevant recollections, reminiscent of the hypothesized cognitive function of awareness. The Declarative Memory is searched both by content and address, suggesting a holographic implementation. The proposed computer architecture may lead to a novel paradigm that solves 'hard' cognitive problems at low cost.
NASA Astrophysics Data System (ADS)
Fellman, Ronald D.; Kaneshiro, Ronald T.; Konstantinides, Konstantinos
1990-03-01
The authors present the design and evaluation of an architecture for a monolithic, programmable, floating-point digital signal processor (DSP) for instrumentation applications. An investigation of the most commonly used algorithms in instrumentation led to a design that satisfies the requirements for high computational and I/O (input/output) throughput. In the arithmetic unit, a 16- x 16-bit multiplier and a 32-bit accumulator provide the capability for single-cycle multiply/accumulate operations, and three format adjusters automatically adjust the data format for increased accuracy and dynamic range. An on-chip I/O unit is capable of handling data block transfers through a direct memory access port and real-time data streams through a pair of parallel I/O ports. I/O operations and program execution are performed in parallel. In addition, the processor includes two data memories with independent addressing units, a microsequencer with instruction RAM, and multiplexers for internal data redirection. The authors also present the structure and implementation of a design environment suitable for the algorithmic, behavioral, and timing simulation of a complete DSP system. Various benchmarking results are reported.
NASA Technical Reports Server (NTRS)
Batcher, K. E.; Eddey, E. E.; Faiss, R. O.; Gilmore, P. A.
1981-01-01
The processing of synthetic aperture radar (SAR) signals using the massively parallel processor (MPP) is discussed. The fast Fourier transform convolution procedures employed in the algorithms are described. The MPP architecture comprises an array unit (ARU) which processes arrays of data; an array control unit which controls the operation of the ARU and performs scalar arithmetic; a program and data management unit which controls the flow of data; and a unique staging memory (SM) which buffers and permutes data. The ARU contains a 128 by 128 array of bit-serial processing elements (PE). Two-by-four surarrays of PE's are packaged in a custom VLSI HCMOS chip. The staging memory is a large multidimensional-access memory which buffers and permutes data flowing with the system. Efficient SAR processing is achieved via ARU communication paths and SM data manipulation. Real time processing capability can be realized via a multiple ARU, multiple SM configuration.
Latest generation interconnect technologies in APEnet+ networking infrastructure
NASA Astrophysics Data System (ADS)
Ammendola, Roberto; Biagioni, Andrea; Cretaro, Paolo; Frezza, Ottorino; Lo Cicero, Francesca; Lonardo, Alessandro; Martinelli, Michele; Stanislao Paolucci, Pier; Pastorelli, Elena; Rossetti, Davide; Simula, Francesco; Vicini, Piero
2017-10-01
In this paper we present the status of the 3rd generation design of the APEnet board (V5) built upon the 28nm Altera Stratix V FPGA; it features a PCIe Gen3 x8 interface and enhanced embedded transceivers with a maximum capability of 12.5Gbps each. The network architecture is designed in accordance to the Remote DMA paradigm. The APEnet+ V5 prototype is built upon the Stratix V DevKit with the addition of a proprietary, third party IP core implementing multi-DMA engines. Support for zero-copy communication is assured by the possibility of DMA-accessing either host and GPU memory, offloading the CPU from the chore of data copying. The current implementation plateaus to a bandwidth for memory read of 4.8GB/s. Here we describe the hardware optimization to the memory write process which relies on the use of two independent DMA engines and an improved TLB.
A compact superconducting nanowire memory element operated by nanowire cryotrons
NASA Astrophysics Data System (ADS)
Zhao, Qing-Yuan; Toomey, Emily A.; Butters, Brenden A.; McCaughan, Adam N.; Dane, Andrew E.; Nam, Sae-Woo; Berggren, Karl K.
2018-07-01
A superconducting loop stores persistent current without any ohmic loss, making it an ideal platform for energy efficient memories. Conventional superconducting memories use an architecture based on Josephson junctions (JJs) and have demonstrated access times less than 10 ps and power dissipation as low as 10-19 J. However, their scalability has been slow to develop due to the challenges in reducing the dimensions of JJs and minimizing the area of the superconducting loops. In addition to the memory itself, complex readout circuits require additional JJs and inductors for coupling signals, increasing the overall area. Here, we have demonstrated a superconducting memory based solely on lithographic nanowires. The small dimensions of the nanowire ensure that the device can be fabricated in a dense area in multiple layers, while the high kinetic inductance makes the loop essentially independent of geometric inductance, allowing it to be scaled down without sacrificing performance. The memory is operated by a group of nanowire cryotrons patterned alongside the storage loop, enabling us to reduce the entire memory cell to 3 μm × 7 μm in our proof-of-concept device. In this work we present the operation principles of a superconducting nanowire memory (nMem) and characterize its bit error rate, speed, and power dissipation.
The Earth Data Analytic Services (EDAS) Framework
NASA Astrophysics Data System (ADS)
Maxwell, T. P.; Duffy, D.
2017-12-01
Faced with unprecedented growth in earth data volume and demand, NASA has developed the Earth Data Analytic Services (EDAS) framework, a high performance big data analytics framework built on Apache Spark. This framework enables scientists to execute data processing workflows combining common analysis operations close to the massive data stores at NASA. The data is accessed in standard (NetCDF, HDF, etc.) formats in a POSIX file system and processed using vetted earth data analysis tools (ESMF, CDAT, NCO, etc.). EDAS utilizes a dynamic caching architecture, a custom distributed array framework, and a streaming parallel in-memory workflow for efficiently processing huge datasets within limited memory spaces with interactive response times. EDAS services are accessed via a WPS API being developed in collaboration with the ESGF Compute Working Team to support server-side analytics for ESGF. The API can be accessed using direct web service calls, a Python script, a Unix-like shell client, or a JavaScript-based web application. New analytic operations can be developed in Python, Java, or Scala (with support for other languages planned). Client packages in Python, Java/Scala, or JavaScript contain everything needed to build and submit EDAS requests. The EDAS architecture brings together the tools, data storage, and high-performance computing required for timely analysis of large-scale data sets, where the data resides, to ultimately produce societal benefits. It is is currently deployed at NASA in support of the Collaborative REAnalysis Technical Environment (CREATE) project, which centralizes numerous global reanalysis datasets onto a single advanced data analytics platform. This service enables decision makers to compare multiple reanalysis datasets and investigate trends, variability, and anomalies in earth system dynamics around the globe.
Software Coherence in Multiprocessor Memory Systems. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Bolosky, William Joseph
1993-01-01
Processors are becoming faster and multiprocessor memory interconnection systems are not keeping up. Therefore, it is necessary to have threads and the memory they access as near one another as possible. Typically, this involves putting memory or caches with the processors, which gives rise to the problem of coherence: if one processor writes an address, any other processor reading that address must see the new value. This coherence can be maintained by the hardware or with software intervention. Systems of both types have been built in the past; the hardware-based systems tended to outperform the software ones. However, the ratio of processor to interconnect speed is now so high that the extra overhead of the software systems may no longer be significant. This issue is explored both by implementing a software maintained system and by introducing and using the technique of offline optimal analysis of memory reference traces. It finds that in properly built systems, software maintained coherence can perform comparably to or even better than hardware maintained coherence. The architectural features necessary for efficient software coherence to be profitable include a small page size, a fast trap mechanism, and the ability to execute instructions while remote memory references are outstanding.
NASA Astrophysics Data System (ADS)
Pleros, Nikos; Maniotis, Pavlos; Alexoudi, Theonitsa; Fitsios, Dimitris; Vagionas, Christos; Papaioannou, Sotiris; Vyrsokinos, K.; Kanellos, George T.
2014-03-01
The processor-memory performance gap, commonly referred to as "Memory Wall" problem, owes to the speed mismatch between processor and electronic RAM clock frequencies, forcing current Chip Multiprocessor (CMP) configurations to consume more than 50% of the chip real-estate for caching purposes. In this article, we present our recent work spanning from Si-based integrated optical RAM cell architectures up to complete optical cache memory architectures for Chip Multiprocessor configurations. Moreover, we discuss on e/o router subsystems with up to Tb/s routing capacity for cache interconnection purposes within CMP configurations, currently pursued within the FP7 PhoxTrot project.
Genome accessibility is widely preserved and locally modulated during mitosis.
Hsiung, Chris C-S; Morrissey, Christapher S; Udugama, Maheshi; Frank, Christopher L; Keller, Cheryl A; Baek, Songjoon; Giardine, Belinda; Crawford, Gregory E; Sung, Myong-Hee; Hardison, Ross C; Blobel, Gerd A
2015-02-01
Mitosis entails global alterations to chromosome structure and nuclear architecture, concomitant with transient silencing of transcription. How cells transmit transcriptional states through mitosis remains incompletely understood. While many nuclear factors dissociate from mitotic chromosomes, the observation that certain nuclear factors and chromatin features remain associated with individual loci during mitosis originated the hypothesis that such mitotically retained molecular signatures could provide transcriptional memory through mitosis. To understand the role of chromatin structure in mitotic memory, we performed the first genome-wide comparison of DNase I sensitivity of chromatin in mitosis and interphase, using a murine erythroblast model. Despite chromosome condensation during mitosis visible by microscopy, the landscape of chromatin accessibility at the macromolecular level is largely unaltered. However, mitotic chromatin accessibility is locally dynamic, with individual loci maintaining none, some, or all of their interphase accessibility. Mitotic reduction in accessibility occurs primarily within narrow, highly DNase hypersensitive sites that frequently coincide with transcription factor binding sites, whereas broader domains of moderate accessibility tend to be more stable. In mitosis, proximal promoters generally maintain their accessibility more strongly, whereas distal regulatory elements tend to lose accessibility. Large domains of DNA hypomethylation mark a subset of promoters that retain accessibility during mitosis and across many cell types in interphase. Erythroid transcription factor GATA1 exerts site-specific changes in interphase accessibility that are most pronounced at distal regulatory elements, but has little influence on mitotic accessibility. We conclude that features of open chromatin are remarkably stable through mitosis, but are modulated at the level of individual genes and regulatory elements. © 2015 Hsiung et al.; Published by Cold Spring Harbor Laboratory Press.
Impact of Cognitive Architectures on Human-Computer Interaction
2014-09-01
activation, reinforced learning, emotion, semantic memory , episodic memory , and visual imagery.12 In 2010 Rosenbloom created a variant of the Soar...being added to almost every new version. In 2004 Nuxoll and Laird added episodic memory to the Soar architecture.11 In 2008 Laird presented...York (NY): Psychology Press; 2014; p. 1–50. 11. Nuxoll A, Laird JE. A cognitive model of episodic memory integrated with a general cognitive
Memory Reconsolidation and Computational Learning
2010-03-01
Cooper and H.T. Siegelmann, "Memory Reconsolidation for Natural Language Processing," Cognitive Neurodynamics , 3, 2009: 365-372. M.M. Olsen, N...computerized memories and other state of the art cognitive architectures, our memory system has the ability to process on-line and in real-time as...on both continuous and binary inputs, unlike state of the art methods in case based reasoning and in cognitive architectures, which are bound to
Bioinspired architecture approach for a one-billion transistor smart CMOS camera chip
NASA Astrophysics Data System (ADS)
Fey, Dietmar; Komann, Marcus
2007-05-01
In the paper we present a massively parallel VLSI architecture for future smart CMOS camera chips with up to one billion transistors. To exploit efficiently the potential offered by future micro- or nanoelectronic devices traditional on central structures oriented parallel architectures based on MIMD or SIMD approaches will fail. They require too long and too many global interconnects for the distribution of code or the access to common memory. On the other hand nature developed self-organising and emergent principles to manage successfully complex structures based on lots of interacting simple elements. Therefore we developed a new as Marching Pixels denoted emergent computing paradigm based on a mixture of bio-inspired computing models like cellular automaton and artificial ants. In the paper we present different Marching Pixels algorithms and the corresponding VLSI array architecture. A detailed synthesis result for a 0.18 μm CMOS process shows that a 256×256 pixel image is processed in less than 10 ms assuming a moderate 100 MHz clock rate for the processor array. Future higher integration densities and a 3D chip stacking technology will allow the integration and processing of Mega pixels within the same time since our architecture is fully scalable.
Optimization of atmospheric transport models on HPC platforms
NASA Astrophysics Data System (ADS)
de la Cruz, Raúl; Folch, Arnau; Farré, Pau; Cabezas, Javier; Navarro, Nacho; Cela, José María
2016-12-01
The performance and scalability of atmospheric transport models on high performance computing environments is often far from optimal for multiple reasons including, for example, sequential input and output, synchronous communications, work unbalance, memory access latency or lack of task overlapping. We investigate how different software optimizations and porting to non general-purpose hardware architectures improve code scalability and execution times considering, as an example, the FALL3D volcanic ash transport model. To this purpose, we implement the FALL3D model equations in the WARIS framework, a software designed from scratch to solve in a parallel and efficient way different geoscience problems on a wide variety of architectures. In addition, we consider further improvements in WARIS such as hybrid MPI-OMP parallelization, spatial blocking, auto-tuning and thread affinity. Considering all these aspects together, the FALL3D execution times for a realistic test case running on general-purpose cluster architectures (Intel Sandy Bridge) decrease by a factor between 7 and 40 depending on the grid resolution. Finally, we port the application to Intel Xeon Phi (MIC) and NVIDIA GPUs (CUDA) accelerator-based architectures and compare performance, cost and power consumption on all the architectures. Implications on time-constrained operational model configurations are discussed.
HYDRA : High-speed simulation architecture for precision spacecraft formation simulation
NASA Technical Reports Server (NTRS)
Martin, Bryan J.; Sohl, Garett.
2003-01-01
e Hierarchical Distributed Reconfigurable Architecture- is a scalable simulation architecture that provides flexibility and ease-of-use which take advantage of modern computation and communication hardware. It also provides the ability to implement distributed - or workstation - based simulations and high-fidelity real-time simulation from a common core. Originally designed to serve as a research platform for examining fundamental challenges in formation flying simulation for future space missions, it is also finding use in other missions and applications, all of which can take advantage of the underlying Object-Oriented structure to easily produce distributed simulations. Hydra automates the process of connecting disparate simulation components (Hydra Clients) through a client server architecture that uses high-level descriptions of data associated with each client to find and forge desirable connections (Hydra Services) at run time. Services communicate through the use of Connectors, which abstract messaging to provide single-interface access to any desired communication protocol, such as from shared-memory message passing to TCP/IP to ACE and COBRA. Hydra shares many features with the HLA, although providing more flexibility in connectivity services and behavior overriding.
Atomic memory access hardware implementations
Ahn, Jung Ho; Erez, Mattan; Dally, William J
2015-02-17
Atomic memory access requests are handled using a variety of systems and methods. According to one example method, a data-processing circuit having an address-request generator that issues requests to a common memory implements a method of processing the requests using a memory-access intervention circuit coupled between the generator and the common memory. The method identifies a current atomic-memory access request from a plurality of memory access requests. A data set is stored that corresponds to the current atomic-memory access request in a data storage circuit within the intervention circuit. It is determined whether the current atomic-memory access request corresponds to at least one previously-stored atomic-memory access request. In response to determining correspondence, the current request is implemented by retrieving data from the common memory. The data is modified in response to the current request and at least one other access request in the memory-access intervention circuit.
Data Movement Dominates: Advanced Memory Technology to Address the Real Exascale Power Problem
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bergman, Keren
Energy is the fundamental barrier to Exascale supercomputing and is dominated by the cost of moving data from one point to another, not computation. Similarly, performance is dominated by data movement, not computation. The solution to this problem requires three critical technologies: 3D integration, optical chip-to-chip communication, and a new communication model. The central goal of the Sandia led "Data Movement Dominates" project aimed to develop memory systems and new architectures based on these technologies that have the potential to lower the cost of local memory accesses by orders of magnitude and provide substantially more bandwidth. Only through these transformationalmore » advances can future systems reach the goals of Exascale computing with a manageable power budgets. The Sandia led team included co-PIs from Columbia University, Lawrence Berkeley Lab, and the University of Maryland. The Columbia effort of Data Movement Dominates focused on developing a physically accurate simulation environment and experimental verification for optically-connected memory (OCM) systems that can enable continued performance scaling through high-bandwidth capacity, energy-efficient bit-rate transparency, and time-of-flight latency. With OCM, memory device parallelism and total capacity can scale to match future high-performance computing requirements without sacrificing data-movement efficiency. When we consider systems with integrated photonics, links to memory can be seamlessly integrated with the interconnection network-in a sense, memory becomes a primary aspect of the interconnection network. At the core of the Columbia effort, toward expanding our understanding of OCM enabled computing we have created an integrated modeling and simulation environment that uniquely integrates the physical behavior of the optical layer. The PhoenxSim suite of design and software tools developed under this effort has enabled the co-design of and performance evaluation photonics-enabled OCM architectures on Exascale computing systems.« less
Ariane 5-ALF: Evolution of the Ariane 5 Data Handling System
NASA Astrophysics Data System (ADS)
Notebaert, O.; Stransky, Arnaud; Corin, Hans; Hult, Torbjorn; Bonnerot, Georges-Albert
2004-06-01
In the coming years, the Ariane 5 On-Board-Computer (OBC) will handle missions and performances enhancements together with the need for significantly reducing costs and the replacement of obsolescent components. The OBC evolution is naturally driven by these factors, but also needs to consider the SW system compliance. Indeed, it would be a major concern that the necessary change of the underlying HW should imply new development of the flight software, mission database and ground control system.The Ariane 5 SW uses ADA language, which enables verifiable definition of the interfaces and provides a standardized level of the real-time behavior. To enforce portability, it has a layered architecture that clearly separates application SW and data from the lower level software. In addition, the on-board mission data is managed thanks to the extraction of an image of the systems database located in a structured memory area (the exchange memory). Used for all interchanges between the system application software and the launcher's subsystems and peripherals, the exchange memory is the virtual view of the Ariane 5 system from the flight SW standpoint. Thanks to these early architectural and structural choices, portability on future hardware is theoretically guaranteed, whenever the exchange memory data structures and the service layer interfaces remains stable. The ALF working group has defined and manufactured a mock-up that fulfils these architectural constraints with a completely new on-board computer featuring improvements such as the microprocessor replacement as well as an advanced integrated I/O controller for access to the system data bus. Lower level SW has been prototyped on this new hardware in order to fulfill the same level of services as the current one while completely hiding the underlying HW/SW implementation to the rest of the system. Functional and performance evaluation of this platform consolidated at system level will show the potential benefits and the limits of such approach.
Min, Xu; Zeng, Wanwen; Chen, Ning; Chen, Ting; Jiang, Rui
2017-07-15
Experimental techniques for measuring chromatin accessibility are expensive and time consuming, appealing for the development of computational approaches to predict open chromatin regions from DNA sequences. Along this direction, existing methods fall into two classes: one based on handcrafted k -mer features and the other based on convolutional neural networks. Although both categories have shown good performance in specific applications thus far, there still lacks a comprehensive framework to integrate useful k -mer co-occurrence information with recent advances in deep learning. We fill this gap by addressing the problem of chromatin accessibility prediction with a convolutional Long Short-Term Memory (LSTM) network with k -mer embedding. We first split DNA sequences into k -mers and pre-train k -mer embedding vectors based on the co-occurrence matrix of k -mers by using an unsupervised representation learning approach. We then construct a supervised deep learning architecture comprised of an embedding layer, three convolutional layers and a Bidirectional LSTM (BLSTM) layer for feature learning and classification. We demonstrate that our method gains high-quality fixed-length features from variable-length sequences and consistently outperforms baseline methods. We show that k -mer embedding can effectively enhance model performance by exploring different embedding strategies. We also prove the efficacy of both the convolution and the BLSTM layers by comparing two variations of the network architecture. We confirm the robustness of our model to hyper-parameters by performing sensitivity analysis. We hope our method can eventually reinforce our understanding of employing deep learning in genomic studies and shed light on research regarding mechanisms of chromatin accessibility. The source code can be downloaded from https://github.com/minxueric/ismb2017_lstm . tingchen@tsinghua.edu.cn or ruijiang@tsinghua.edu.cn. Supplementary materials are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Min, Xu; Zeng, Wanwen; Chen, Ning; Chen, Ting; Jiang, Rui
2017-01-01
Abstract Motivation: Experimental techniques for measuring chromatin accessibility are expensive and time consuming, appealing for the development of computational approaches to predict open chromatin regions from DNA sequences. Along this direction, existing methods fall into two classes: one based on handcrafted k-mer features and the other based on convolutional neural networks. Although both categories have shown good performance in specific applications thus far, there still lacks a comprehensive framework to integrate useful k-mer co-occurrence information with recent advances in deep learning. Results: We fill this gap by addressing the problem of chromatin accessibility prediction with a convolutional Long Short-Term Memory (LSTM) network with k-mer embedding. We first split DNA sequences into k-mers and pre-train k-mer embedding vectors based on the co-occurrence matrix of k-mers by using an unsupervised representation learning approach. We then construct a supervised deep learning architecture comprised of an embedding layer, three convolutional layers and a Bidirectional LSTM (BLSTM) layer for feature learning and classification. We demonstrate that our method gains high-quality fixed-length features from variable-length sequences and consistently outperforms baseline methods. We show that k-mer embedding can effectively enhance model performance by exploring different embedding strategies. We also prove the efficacy of both the convolution and the BLSTM layers by comparing two variations of the network architecture. We confirm the robustness of our model to hyper-parameters by performing sensitivity analysis. We hope our method can eventually reinforce our understanding of employing deep learning in genomic studies and shed light on research regarding mechanisms of chromatin accessibility. Availability and implementation: The source code can be downloaded from https://github.com/minxueric/ismb2017_lstm. Contact: tingchen@tsinghua.edu.cn or ruijiang@tsinghua.edu.cn Supplementary information: Supplementary materials are available at Bioinformatics online. PMID:28881969
Phipps, Eric T.; D'Elia, Marta; Edwards, Harold C.; ...
2017-04-18
In this study, quantifying simulation uncertainties is a critical component of rigorous predictive simulation. A key component of this is forward propagation of uncertainties in simulation input data to output quantities of interest. Typical approaches involve repeated sampling of the simulation over the uncertain input data, and can require numerous samples when accurately propagating uncertainties from large numbers of sources. Often simulation processes from sample to sample are similar and much of the data generated from each sample evaluation could be reused. We explore a new method for implementing sampling methods that simultaneously propagates groups of samples together in anmore » embedded fashion, which we call embedded ensemble propagation. We show how this approach takes advantage of properties of modern computer architectures to improve performance by enabling reuse between samples, reducing memory bandwidth requirements, improving memory access patterns, improving opportunities for fine-grained parallelization, and reducing communication costs. We describe a software technique for implementing embedded ensemble propagation based on the use of C++ templates and describe its integration with various scientific computing libraries within Trilinos. We demonstrate improved performance, portability and scalability for the approach applied to the simulation of partial differential equations on a variety of CPU, GPU, and accelerator architectures, including up to 131,072 cores on a Cray XK7 (Titan).« less
Robust quantum network architectures and topologies for entanglement distribution
NASA Astrophysics Data System (ADS)
Das, Siddhartha; Khatri, Sumeet; Dowling, Jonathan P.
2018-01-01
Entanglement distribution is a prerequisite for several important quantum information processing and computing tasks, such as quantum teleportation, quantum key distribution, and distributed quantum computing. In this work, we focus on two-dimensional quantum networks based on optical quantum technologies using dual-rail photonic qubits for the building of a fail-safe quantum internet. We lay out a quantum network architecture for entanglement distribution between distant parties using a Bravais lattice topology, with the technological constraint that quantum repeaters equipped with quantum memories are not easily accessible. We provide a robust protocol for simultaneous entanglement distribution between two distant groups of parties on this network. We also discuss a memory-based quantum network architecture that can be implemented on networks with an arbitrary topology. We examine networks with bow-tie lattice and Archimedean lattice topologies and use percolation theory to quantify the robustness of the networks. In particular, we provide figures of merit on the loss parameter of the optical medium that depend only on the topology of the network and quantify the robustness of the network against intermittent photon loss and intermittent failure of nodes. These figures of merit can be used to compare the robustness of different network topologies in order to determine the best topology in a given real-world scenario, which is critical in the realization of the quantum internet.
NASA Technical Reports Server (NTRS)
Mavriplis, D. J.; Das, Raja; Saltz, Joel; Vermeland, R. E.
1992-01-01
An efficient three dimensional unstructured Euler solver is parallelized on a Cray Y-MP C90 shared memory computer and on an Intel Touchstone Delta distributed memory computer. This paper relates the experiences gained and describes the software tools and hardware used in this study. Performance comparisons between two differing architectures are made.
Parallel computing for probabilistic fatigue analysis
NASA Technical Reports Server (NTRS)
Sues, Robert H.; Lua, Yuan J.; Smith, Mark D.
1993-01-01
This paper presents the results of Phase I research to investigate the most effective parallel processing software strategies and hardware configurations for probabilistic structural analysis. We investigate the efficiency of both shared and distributed-memory architectures via a probabilistic fatigue life analysis problem. We also present a parallel programming approach, the virtual shared-memory paradigm, that is applicable across both types of hardware. Using this approach, problems can be solved on a variety of parallel configurations, including networks of single or multiprocessor workstations. We conclude that it is possible to effectively parallelize probabilistic fatigue analysis codes; however, special strategies will be needed to achieve large-scale parallelism to keep large number of processors busy and to treat problems with the large memory requirements encountered in practice. We also conclude that distributed-memory architecture is preferable to shared-memory for achieving large scale parallelism; however, in the future, the currently emerging hybrid-memory architectures will likely be optimal.
DDGIPS: a general image processing system in robot vision
NASA Astrophysics Data System (ADS)
Tian, Yuan; Ying, Jun; Ye, Xiuqing; Gu, Weikang
2000-10-01
Real-Time Image Processing is the key work in robot vision. With the limitation of the hardware technique, many algorithm-oriented firmware systems were designed in the past. But their architectures were not flexible enough to achieve a multi-algorithm development system. Because of the rapid development of microelectronics technique, many high performance DSP chips and high density FPGA chips have come to life, and this makes it possible to construct a more flexible architecture in real-time image processing system. In this paper, a Double DSP General Image Processing System (DDGIPS) is concerned. We try to construct a two-DSP-based FPGA-computational system with two TMS320C6201s. The TMS320C6x devices are fixed-point processors based on the advanced VLIW CPU, which has eight functional units, including two multipliers and six arithmetic logic units. These features make C6x a good candidate for a general purpose system. In our system, the two TMS320C6201s each has a local memory space, and they also have a shared system memory space which enables them to intercommunicate and exchange data efficiently. At the same time, they can be directly inter-connected in star-shaped architecture. All of these are under the control of a FPGA group. As the core of the system, FPGA plays a very important role: it takes charge of DPS control, DSP communication, memory space access arbitration and the communication between the system and the host machine. And taking advantage of reconfiguring FPGA, all of the interconnection between the two DSP or between DSP and FPGA can be changed. In this way, users can easily rebuild the real-time image processing system according to the data stream and the task of the application and gain great flexibility.
DDGIPS: a general image processing system in robot vision
NASA Astrophysics Data System (ADS)
Tian, Yuan; Ying, Jun; Ye, Xiuqing; Gu, Weikang
2000-10-01
Real-Time Image Processing is the key work in robot vision. With the limitation of the hardware technique, many algorithm-oriented firmware systems were designed in the past. But their architectures were not flexible enough to achieve a multi- algorithm development system. Because of the rapid development of microelectronics technique, many high performance DSP chips and high density FPGA chips have come to life, and this makes it possible to construct a more flexible architecture in real-time image processing system. In this paper, a Double DSP General Image Processing System (DDGIPS) is concerned. We try to construct a two-DSP-based FPGA-computational system with two TMS320C6201s. The TMS320C6x devices are fixed-point processors based on the advanced VLIW CPU, which has eight functional units, including two multipliers and six arithmetic logic units. These features make C6x a good candidate for a general purpose system. In our system, the two TMS320C6210s each has a local memory space, and they also have a shared system memory space which enable them to intercommunicate and exchange data efficiently. At the same time, they can be directly interconnected in star- shaped architecture. All of these are under the control of FPGA group. As the core of the system, FPGA plays a very important role: it takes charge of DPS control, DSP communication, memory space access arbitration and the communication between the system and the host machine. And taking advantage of reconfiguring FPGA, all of the interconnection between the two DSP or between DSP and FPGA can be changed. In this way, users can easily rebuild the real-time image processing system according to the data stream and the task of the application and gain great flexibility.
The Efficiency and the Scalability of an Explicit Operator on an IBM POWER4 System
NASA Technical Reports Server (NTRS)
Frumkin, Michael; Biegel, Bryan A. (Technical Monitor)
2002-01-01
We present an evaluation of the efficiency and the scalability of an explicit CFD operator on an IBM POWER4 system. The POWER4 architecture exhibits a common trend in HPC architectures: boosting CPU processing power by increasing the number of functional units, while hiding the latency of memory access by increasing the depth of the memory hierarchy. The overall machine performance depends on the ability of the caches-buses-fabric-memory to feed the functional units with the data to be processed. In this study we evaluate the efficiency and scalability of one explicit CFD operator on an IBM POWER4. This operator performs computations at the points of a Cartesian grid and involves a few dozen floating point numbers and on the order of 100 floating point operations per grid point. The computations in all grid points are independent. Specifically, we estimate the efficiency of the RHS operator (SP of NPB) on a single processor as the observed/peak performance ratio. Then we estimate the scalability of the operator on a single chip (2 CPUs), a single MCM (8 CPUs), 16 CPUs, and the whole machine (32 CPUs). Then we perform the same measurements for a chache-optimized version of the RHS operator. For our measurements we use the HPM (Hardware Performance Monitor) counters available on the POWER4. These counters allow us to analyze the obtained performance results.
Phase space simulation of collisionless stellar systems on the massively parallel processor
NASA Technical Reports Server (NTRS)
White, Richard L.
1987-01-01
A numerical technique for solving the collisionless Boltzmann equation describing the time evolution of a self gravitating fluid in phase space was implemented on the Massively Parallel Processor (MPP). The code performs calculations for a two dimensional phase space grid (with one space and one velocity dimension). Some results from calculations are presented. The execution speed of the code is comparable to the speed of a single processor of a Cray-XMP. Advantages and disadvantages of the MPP architecture for this type of problem are discussed. The nearest neighbor connectivity of the MPP array does not pose a significant obstacle. Future MPP-like machines should have much more local memory and easier access to staging memory and disks in order to be effective for this type of problem.
Design of Belief Propagation Based on FPGA for the Multistereo CAFADIS Camera
Magdaleno, Eduardo; Lüke, Jonás Philipp; Rodríguez, Manuel; Rodríguez-Ramos, José Manuel
2010-01-01
In this paper we describe a fast, specialized hardware implementation of the belief propagation algorithm for the CAFADIS camera, a new plenoptic sensor patented by the University of La Laguna. This camera captures the lightfield of the scene and can be used to find out at which depth each pixel is in focus. The algorithm has been designed for FPGA devices using VHDL. We propose a parallel and pipeline architecture to implement the algorithm without external memory. Although the BRAM resources of the device increase considerably, we can maintain real-time restrictions by using extremely high-performance signal processing capability through parallelism and by accessing several memories simultaneously. The quantifying results with 16 bit precision have shown that performances are really close to the original Matlab programmed algorithm. PMID:22163404
Design of belief propagation based on FPGA for the multistereo CAFADIS camera.
Magdaleno, Eduardo; Lüke, Jonás Philipp; Rodríguez, Manuel; Rodríguez-Ramos, José Manuel
2010-01-01
In this paper we describe a fast, specialized hardware implementation of the belief propagation algorithm for the CAFADIS camera, a new plenoptic sensor patented by the University of La Laguna. This camera captures the lightfield of the scene and can be used to find out at which depth each pixel is in focus. The algorithm has been designed for FPGA devices using VHDL. We propose a parallel and pipeline architecture to implement the algorithm without external memory. Although the BRAM resources of the device increase considerably, we can maintain real-time restrictions by using extremely high-performance signal processing capability through parallelism and by accessing several memories simultaneously. The quantifying results with 16 bit precision have shown that performances are really close to the original Matlab programmed algorithm.
Low power test architecture for dynamic read destructive fault detection in SRAM
NASA Astrophysics Data System (ADS)
Takher, Vikram Singh; Choudhary, Rahul Raj
2018-06-01
Dynamic Read Destructive Fault (dRDF) is the outcome of resistive open defects in the core cells of static random-access memories (SRAMs). The sensitisation of dRDF involves either performing multiple read operations or creation of number of read equivalent stress (RES), on the core cell under test. Though the creation of RES is preferred over the performing multiple read operation on the core cell, cell dissipates more power during RES than during the read or write operation. This paper focuses on the reduction in power dissipation by optimisation of number of RESs, which are required to sensitise the dRDF during test mode of operation of SRAM. The novel pre-charge architecture has been proposed in order to reduce the power dissipation by limiting the number of RESs to an optimised number of two. The proposed low power architecture is simulated and analysed which shows reduction in power dissipation by reducing the number of RESs up to 18.18%.
Integrating Cache Performance Modeling and Tuning Support in Parallelization Tools
NASA Technical Reports Server (NTRS)
Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)
1998-01-01
With the resurgence of distributed shared memory (DSM) systems based on cache-coherent Non Uniform Memory Access (ccNUMA) architectures and increasing disparity between memory and processors speeds, data locality overheads are becoming the greatest bottlenecks in the way of realizing potential high performance of these systems. While parallelization tools and compilers facilitate the users in porting their sequential applications to a DSM system, a lot of time and effort is needed to tune the memory performance of these applications to achieve reasonable speedup. In this paper, we show that integrating cache performance modeling and tuning support within a parallelization environment can alleviate this problem. The Cache Performance Modeling and Prediction Tool (CPMP), employs trace-driven simulation techniques without the overhead of generating and managing detailed address traces. CPMP predicts the cache performance impact of source code level "what-if" modifications in a program to assist a user in the tuning process. CPMP is built on top of a customized version of the Computer Aided Parallelization Tools (CAPTools) environment. Finally, we demonstrate how CPMP can be applied to tune a real Computational Fluid Dynamics (CFD) application.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bales, Benjamin B; Barrett, Richard F
In almost all modern scientific applications, developers achieve the greatest performance gains by tuning algorithms, communication systems, and memory access patterns, while leaving low level instruction optimizations to the compiler. Given the increasingly varied and complicated x86 architectures, the value of these optimizations is unclear, and, due to time and complexity constraints, it is difficult for many programmers to experiment with them. In this report we explore the potential gains of these 'last mile' optimization efforts on an AMD Barcelona processor, providing readers with relevant information so that they can decide whether investment in the presented optimizations is worthwhile.
jmzML, an open-source Java API for mzML, the PSI standard for MS data.
Côté, Richard G; Reisinger, Florian; Martens, Lennart
2010-04-01
We here present jmzML, a Java API for the Proteomics Standards Initiative mzML data standard. Based on the Java Architecture for XML Binding and XPath-based XML indexer random-access XML parser, jmzML can handle arbitrarily large files in minimal memory, allowing easy and efficient processing of mzML files using the Java programming language. jmzML also automatically resolves internal XML references on-the-fly. The library (which includes a viewer) can be downloaded from http://jmzml.googlecode.com.
Wolinski, Christophe Czeslaw [Los Alamos, NM; Gokhale, Maya B [Los Alamos, NM; McCabe, Kevin Peter [Los Alamos, NM
2011-01-18
Fabric-based computing systems and methods are disclosed. A fabric-based computing system can include a polymorphous computing fabric that can be customized on a per application basis and a host processor in communication with said polymorphous computing fabric. The polymorphous computing fabric includes a cellular architecture that can be highly parameterized to enable a customized synthesis of fabric instances for a variety of enhanced application performances thereof. A global memory concept can also be included that provides the host processor random access to all variables and instructions associated with the polymorphous computing fabric.
Computing NLTE Opacities -- Node Level Parallel Calculation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Holladay, Daniel
Presentation. The goal: to produce a robust library capable of computing reasonably accurate opacities inline with the assumption of LTE relaxed (non-LTE). Near term: demonstrate acceleration of non-LTE opacity computation. Far term (if funded): connect to application codes with in-line capability and compute opacities. Study science problems. Use efficient algorithms that expose many levels of parallelism and utilize good memory access patterns for use on advanced architectures. Portability to multiple types of hardware including multicore processors, manycore processors such as KNL, GPUs, etc. Easily coupled to radiation hydrodynamics and thermal radiative transfer codes.
2014-08-01
searchrequired for SPH are described in Sect. 3. Section 4 contains aperformance analysis of the algorithm using Kepler -type GPUcards. 2. Numerical...generation of Kepler architecture, code nameGK104, which is also implemented in Tesla K10. The Keplerarchitecture relies on a Graphics Processing Cluster (GPC...lat-ter is 512 KB large and has a bandwidth of 512 B/clockcycle. Constant memory (read only per grid): 48 KB per Kepler SM.Used to hold constants
NASA Technical Reports Server (NTRS)
Shalkhauser, Mary JO; Quintana, Jorge A.; Soni, Nitin J.
1994-01-01
The NASA Lewis Research Center is developing a multichannel communication signal processing satellite (MCSPS) system which will provide low data rate, direct to user, commercial communications services. The focus of current space segment developments is a flexible, high-throughput, fault tolerant onboard information switching processor. This information switching processor (ISP) is a destination-directed packet switch which performs both space and time switching to route user information among numerous user ground terminals. Through both industry study contracts and in-house investigations, several packet switching architectures were examined. A contention-free approach, the shared memory per beam architecture, was selected for implementation. The shared memory per beam architecture, fault tolerance insertion, implementation, and demonstration plans are described.
NASA Astrophysics Data System (ADS)
Hassan, A. H.; Fluke, C. J.; Barnes, D. G.
2012-09-01
Upcoming and future astronomy research facilities will systematically generate terabyte-sized data sets moving astronomy into the Petascale data era. While such facilities will provide astronomers with unprecedented levels of accuracy and coverage, the increases in dataset size and dimensionality will pose serious computational challenges for many current astronomy data analysis and visualization tools. With such data sizes, even simple data analysis tasks (e.g. calculating a histogram or computing data minimum/maximum) may not be achievable without access to a supercomputing facility. To effectively handle such dataset sizes, which exceed today's single machine memory and processing limits, we present a framework that exploits the distributed power of GPUs and many-core CPUs, with a goal of providing data analysis and visualizing tasks as a service for astronomers. By mixing shared and distributed memory architectures, our framework effectively utilizes the underlying hardware infrastructure handling both batched and real-time data analysis and visualization tasks. Offering such functionality as a service in a “software as a service” manner will reduce the total cost of ownership, provide an easy to use tool to the wider astronomical community, and enable a more optimized utilization of the underlying hardware infrastructure.
NASA Astrophysics Data System (ADS)
Zhang, Lei; Zhu, Liang; Li, Xiaomei; Xu, Zhi; Wang, Wenlong; Bai, Xuedong
2017-03-01
One diode-one resistor (1D1R) memory is an effective architecture to suppress the crosstalk interference, realizing the crossbar network integration of resistive random access memory (RRAM). Herein, we designed a p+-Si/n-ZnO heterostructure with 1D1R function. Compared with the conventional multilayer 1D1R devices, the structure and fabrication technique can be largely simplified. The real-time imaging of formation/rupture process of conductive filament (CF) process demonstrated the RS mechanism by in-situ transmission electron microscopy (TEM). Meanwhile, we observed that the formed CF is only confined to the outside of depletion region of Si/ZnO pn junction, and the formation of CF does not degrade the diode performance, which allows the coexistence of RS and rectifying behaviors, revealing the 1D1R switching model. Furthermore, it has been confirmed that the CF is consisting of the oxygen vacancy by in-situ TEM characterization.
NASA Technical Reports Server (NTRS)
Katti, R.; Wu, J.; Stadler, H.
1990-01-01
Vertical Bloch Line (VBL) memory is a recently conceived, integrated, solid-state, block-access, VLSI memory which offers the potential of 1Gbit/sq cm real storage density, gigabit per second data rates, and sub-millisecond average access times simultaneously at relatively low mass, volume, and power values when compared to alternative technologies. VBL's are micromagnetic structures within magnetic domain walls which can be manipulated using magnetic fields from integrated conductors. The presence or absence of VBL pairs are used to store binary information. At present, efforts are being directed at developing a single-chip memory using 25Mbit/sq cm technology in magnetic garnet material which integrates, at a single operating point, the writing, storage, reading, and amplification functions needed in a memory. This paper describes the current design architecture, functional elements, and supercomputer simulation results which are used to assist the design process. The current design architecture uses three metal layers, two ion implantation steps for modulating the thickness of the magnetic layer, one ion implantation step for assisting propagation in the major line track, one NiFe soft magnetic layer, one CoPt hard magnetic layer, and one reflective Cr layer for facilitating magneto-optic observation of magnetic structure. Data are stored in a series of elongated magnetic domains, called stripes, which serve as storage sites for arrays of VBL pairs. The ends of these stripes are placed near conductors which serve as VBL read/write gates. A major line track is present to provide a source and propagation path for magnetic bubbles. Writing and reading, respectively, are achieved by converting magnetic bubbles to VBL's and vice versa. The output function is effected by stretching a magnetic bubble and detecting it magnetoresistively. Experimental results from the past design cycle created four design goals for the current design cycle. First, the bias field ranges for the stripes and the major line needed to be matched. Second, the magnetic field barrier between the stripe and the read/write gates needed to be reduced. Third, current conductor routing needed to be improved to reduce occurrences of open-circuiting, short-circuiting, and eddy-current shielding. Fourth, a modified Co-alloy was needed with an increased coercivity and controlled magnetization to allow VBL stabilization to occur without affecting stripe stability.
A novel parallel architecture for local histogram equalization
NASA Astrophysics Data System (ADS)
Ohannessian, Mesrob I.; Choueiter, Ghinwa F.; Diab, Hassan
2005-07-01
Local histogram equalization is an image enhancement algorithm that has found wide application in the pre-processing stage of areas such as computer vision, pattern recognition and medical imaging. The computationally intensive nature of the procedure, however, is a main limitation when real time interactive applications are in question. This work explores the possibility of performing parallel local histogram equalization, using an array of special purpose elementary processors, through an HDL implementation that targets FPGA or ASIC platforms. A novel parallelization scheme is presented and the corresponding architecture is derived. The algorithm is reduced to pixel-level operations. Processing elements are assigned image blocks, to maintain a reasonable performance-cost ratio. To further simplify both processor and memory organizations, a bit-serial access scheme is used. A brief performance assessment is provided to illustrate and quantify the merit of the approach.
Ji, Yongsung; Zeigler, David F; Lee, Dong Su; Choi, Hyejung; Jen, Alex K-Y; Ko, Heung Cho; Kim, Tae-Wook
2013-01-01
Flexible organic memory devices are one of the integral components for future flexible organic electronics. However, high-density all-organic memory cell arrays on malleable substrates without cross-talk have not been demonstrated because of difficulties in their fabrication and relatively poor performances to date. Here we demonstrate the first flexible all-organic 64-bit memory cell array possessing one diode-one resistor architectures. Our all-organic one diode-one resistor cell exhibits excellent rewritable switching characteristics, even during and after harsh physical stresses. The write-read-erase-read output sequence of the cells perfectly correspond to the external pulse signal regardless of substrate deformation. The one diode-one resistor cell array is clearly addressed at the specified cells and encoded letters based on the standard ASCII character code. Our study on integrated organic memory cell arrays suggests that the all-organic one diode-one resistor cell architecture is suitable for high-density flexible organic memory applications in the future.
Phenomenal and access consciousness in olfaction.
Stevenson, Richard J
2009-12-01
Contemporary literature on consciousness, with some exceptions, rarely considers the olfactory system. In this article the characteristics of olfactory consciousness, viewed from the standpoint of the phenomenal (P)/access (A) distinction, are examined relative to the major senses. The review details several qualitative differences in both olfactory P consciousness (shifts in the felt location, universal synesthesia-like and affect-rich experiences, and misperceptions) and A consciousness (recovery from habituation, capacity for conscious processing, access to semantic and episodic memory, learning, attention, and in the serial-unitary nature of olfactory percepts). The basis for these differences is argued to arise from the functions that the olfactory system performs and from the unique neural architecture needed to instantiate them. These data suggest, at a minimum, that P and A consciousness are uniquely configured in olfaction and an argument can be made that the P and A distinction may not hold for this sensory system.
A processing architecture for associative short-term memory in electronic noses
NASA Astrophysics Data System (ADS)
Pioggia, G.; Ferro, M.; Di Francesco, F.; DeRossi, D.
2006-11-01
Electronic nose (e-nose) architectures usually consist of several modules that process various tasks such as control, data acquisition, data filtering, feature selection and pattern analysis. Heterogeneous techniques derived from chemometrics, neural networks, and fuzzy rules used to implement such tasks may lead to issues concerning module interconnection and cooperation. Moreover, a new learning phase is mandatory once new measurements have been added to the dataset, thus causing changes in the previously derived model. Consequently, if a loss in the previous learning occurs (catastrophic interference), real-time applications of e-noses are limited. To overcome these problems this paper presents an architecture for dynamic and efficient management of multi-transducer data processing techniques and for saving an associative short-term memory of the previously learned model. The architecture implements an artificial model of a hippocampus-based working memory, enabling the system to be ready for real-time applications. Starting from the base models available in the architecture core, dedicated models for neurons, maps and connections were tailored to an artificial olfactory system devoted to analysing olive oil. In order to verify the ability of the processing architecture in associative and short-term memory, a paired-associate learning test was applied. The avoidance of catastrophic interference was observed.
NASA Technical Reports Server (NTRS)
Schwab, Andrew J. (Inventor); Aylor, James (Inventor); Hitchcock, Charles Young (Inventor); Wulf, William A. (Inventor); McKee, Sally A. (Inventor); Moyer, Stephen A. (Inventor); Klenke, Robert (Inventor)
2000-01-01
A data processing system is disclosed which comprises a data processor and memory control device for controlling the access of information from the memory. The memory control device includes temporary storage and decision ability for determining what order to execute the memory accesses. The compiler detects the requirements of the data processor and selects the data to stream to the memory control device which determines a memory access order. The order in which to access said information is selected based on the location of information stored in the memory. The information is repeatedly accessed from memory and stored in the temporary storage until all streamed information is accessed. The information is stored until required by the data processor. The selection of the order in which to access information maximizes bandwidth and decreases the retrieval time.
A Core Knowledge Architecture of Visual Working Memory
ERIC Educational Resources Information Center
Wood, Justin N.
2011-01-01
Visual working memory (VWM) is widely thought to contain specialized buffers for retaining spatial and object information: a "spatial-object architecture." However, studies of adults, infants, and nonhuman animals show that visual cognition builds on core knowledge systems that retain more specialized representations: (1) spatiotemporal…
General-purpose interface bus for multiuser, multitasking computer system
NASA Technical Reports Server (NTRS)
Generazio, Edward R.; Roth, Don J.; Stang, David B.
1990-01-01
The architecture of a multiuser, multitasking, virtual-memory computer system intended for the use by a medium-size research group is described. There are three central processing units (CPU) in the configuration, each with 16 MB memory, and two 474 MB hard disks attached. CPU 1 is designed for data analysis and contains an array processor for fast-Fourier transformations. In addition, CPU 1 shares display images viewed with the image processor. CPU 2 is designed for image analysis and display. CPU 3 is designed for data acquisition and contains 8 GPIB channels and an analog-to-digital conversion input/output interface with 16 channels. Up to 9 users can access the third CPU simultaneously for data acquisition. Focus is placed on the optimization of hardware interfaces and software, facilitating instrument control, data acquisition, and processing.
Coward, L Andrew
2010-01-01
A model is described in which the hippocampal system functions as resource manager for the neocortex. This model is developed from an architectural concept for the brain as a whole within which the receptive fields of neocortical columns can gradually expand but with some limited exceptions tend not to contract. The definition process for receptive fields is constrained so that they overlap as little as possible, and change as little as possible, but at least a minimum number of columns detect their fields within every sensory input state. Below this minimum, the receptive fields of some columns are expanded slightly until the minimum level is reached. The columns in which this expansion occurs are selected by a competitive process in the hippocampal system that identifies those in which only a relatively small expansion is required, and sends signals to those columns that trigger the expansion. These expansions in receptive fields are the information record that forms the declarative memory of the input state. Episodic memory activates a set of columns in which receptive fields expanded simultaneously at some point in the past, and the hippocampal system is therefore the appropriate source for information guiding access to such memories. Semantic memory associates columns that are often active (with or without expansions in receptive fields) simultaneously. Initially, the hippocampus can guide access to such memories on the basis of initial information recording, but to avoid corruption of the information needed for ongoing resource management, access control shifts to other parts of the neocortex. The roles of the mammillary bodies, amygdala and anterior thalamic nucleus can be understood as modulating information recording in accordance with various behavioral priorities. During sleep, provisional physical connectivity is created that supports receptive field expansions in the subsequent wake period, but previously created memories are not affected. This model matches a wide range of neuropsychological observation better than alternative hippocampal models. The information mechanisms required by the model are consistent with known brain anatomy and neuron physiology.
An FPGA-Based Massively Parallel Neuromorphic Cortex Simulator
Wang, Runchun M.; Thakur, Chetan S.; van Schaik, André
2018-01-01
This paper presents a massively parallel and scalable neuromorphic cortex simulator designed for simulating large and structurally connected spiking neural networks, such as complex models of various areas of the cortex. The main novelty of this work is the abstraction of a neuromorphic architecture into clusters represented by minicolumns and hypercolumns, analogously to the fundamental structural units observed in neurobiology. Without this approach, simulating large-scale fully connected networks needs prohibitively large memory to store look-up tables for point-to-point connections. Instead, we use a novel architecture, based on the structural connectivity in the neocortex, such that all the required parameters and connections can be stored in on-chip memory. The cortex simulator can be easily reconfigured for simulating different neural networks without any change in hardware structure by programming the memory. A hierarchical communication scheme allows one neuron to have a fan-out of up to 200 k neurons. As a proof-of-concept, an implementation on one Altera Stratix V FPGA was able to simulate 20 million to 2.6 billion leaky-integrate-and-fire (LIF) neurons in real time. We verified the system by emulating a simplified auditory cortex (with 100 million neurons). This cortex simulator achieved a low power dissipation of 1.62 μW per neuron. With the advent of commercially available FPGA boards, our system offers an accessible and scalable tool for the design, real-time simulation, and analysis of large-scale spiking neural networks. PMID:29692702
An FPGA-Based Massively Parallel Neuromorphic Cortex Simulator.
Wang, Runchun M; Thakur, Chetan S; van Schaik, André
2018-01-01
This paper presents a massively parallel and scalable neuromorphic cortex simulator designed for simulating large and structurally connected spiking neural networks, such as complex models of various areas of the cortex. The main novelty of this work is the abstraction of a neuromorphic architecture into clusters represented by minicolumns and hypercolumns, analogously to the fundamental structural units observed in neurobiology. Without this approach, simulating large-scale fully connected networks needs prohibitively large memory to store look-up tables for point-to-point connections. Instead, we use a novel architecture, based on the structural connectivity in the neocortex, such that all the required parameters and connections can be stored in on-chip memory. The cortex simulator can be easily reconfigured for simulating different neural networks without any change in hardware structure by programming the memory. A hierarchical communication scheme allows one neuron to have a fan-out of up to 200 k neurons. As a proof-of-concept, an implementation on one Altera Stratix V FPGA was able to simulate 20 million to 2.6 billion leaky-integrate-and-fire (LIF) neurons in real time. We verified the system by emulating a simplified auditory cortex (with 100 million neurons). This cortex simulator achieved a low power dissipation of 1.62 μW per neuron. With the advent of commercially available FPGA boards, our system offers an accessible and scalable tool for the design, real-time simulation, and analysis of large-scale spiking neural networks.
Fabry-Perot confocal resonator optical associative memory
NASA Astrophysics Data System (ADS)
Burns, Thomas J.; Rogers, Steven K.; Vogel, George A.
1993-03-01
A unique optical associative memory architecture is presented that combines the optical processing environment of a Fabry-Perot confocal resonator with the dynamic storage and recall properties of volume holograms. The confocal resonator reduces the size and complexity of previous associative memory architectures by folding a large number of discrete optical components into an integrated, compact optical processing environment. Experimental results demonstrate the system is capable of recalling a complete object from memory when presented with partial information about the object. A Fourier optics model of the system's operation shows it implements a spatially continuous version of a discrete, binary Hopfield neural network associative memory.
Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Biegel, Bryan A. (Technical Monitor); Jost, G.; Jin, H.; Labarta J.; Gimenez, J.; Caubet, J.
2003-01-01
Parallel programming paradigms include process level parallelism, thread level parallelization, and multilevel parallelism. This viewgraph presentation describes a detailed performance analysis of these paradigms for Shared Memory Architecture (SMA). This analysis uses the Paraver Performance Analysis System. The presentation includes diagrams of a flow of useful computations.
A Multi-Level Parallelization Concept for High-Fidelity Multi-Block Solvers
NASA Technical Reports Server (NTRS)
Hatay, Ferhat F.; Jespersen, Dennis C.; Guruswamy, Guru P.; Rizk, Yehia M.; Byun, Chansup; Gee, Ken; VanDalsem, William R. (Technical Monitor)
1997-01-01
The integration of high-fidelity Computational Fluid Dynamics (CFD) analysis tools with the industrial design process benefits greatly from the robust implementations that are transportable across a wide range of computer architectures. In the present work, a hybrid domain-decomposition and parallelization concept was developed and implemented into the widely-used NASA multi-block Computational Fluid Dynamics (CFD) packages implemented in ENSAERO and OVERFLOW. The new parallel solver concept, PENS (Parallel Euler Navier-Stokes Solver), employs both fine and coarse granularity in data partitioning as well as data coalescing to obtain the desired load-balance characteristics on the available computer platforms. This multi-level parallelism implementation itself introduces no changes to the numerical results, hence the original fidelity of the packages are identically preserved. The present implementation uses the Message Passing Interface (MPI) library for interprocessor message passing and memory accessing. By choosing an appropriate combination of the available partitioning and coalescing capabilities only during the execution stage, the PENS solver becomes adaptable to different computer architectures from shared-memory to distributed-memory platforms with varying degrees of parallelism. The PENS implementation on the IBM SP2 distributed memory environment at the NASA Ames Research Center obtains 85 percent scalable parallel performance using fine-grain partitioning of single-block CFD domains using up to 128 wide computational nodes. Multi-block CFD simulations of complete aircraft simulations achieve 75 percent perfect load-balanced executions using data coalescing and the two levels of parallelism. SGI PowerChallenge, SGI Origin 2000, and a cluster of workstations are the other platforms where the robustness of the implementation is tested. The performance behavior on the other computer platforms with a variety of realistic problems will be included as this on-going study progresses.
Bayer image parallel decoding based on GPU
NASA Astrophysics Data System (ADS)
Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua
2012-11-01
In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.
Active holographic interconnects for interfacing volume storage
NASA Astrophysics Data System (ADS)
Domash, Lawrence H.; Schwartz, Jay R.; Nelson, Arthur R.; Levin, Philip S.
1992-04-01
In order to achieve the promise of terabit/cm3 data storage capacity for volume holographic optical memory, two technological challenges must be met. Satisfactory storage materials must be developed and the input/output architectures able to match their capacity with corresponding data access rates must also be designed. To date the materials problem has received more attention than devices and architectures for access and addressing. Two philosophies of parallel data access to 3-D storage have been discussed. The bit-oriented approach, represented by recent work on two-photon memories, attempts to store bits at local sites within a volume without affecting neighboring bits. High speed acousto-optic or electro- optic scanners together with dynamically focused lenses not presently available would be required. The second philosophy is that volume optical storage is essentially holographic in nature, and that each data write or read is to be distributed throughout the material volume on the basis of angle multiplexing or other schemes consistent with the principles of holography. The requirements for free space optical interconnects for digital computers and fiber optic network switching interfaces are also closely related to this class of devices. Interconnects, beamlet generators, angle multiplexers, scanners, fiber optic switches, and dynamic lenses are all devices which may be implemented by holographic or microdiffractive devices of various kinds, which we shall refer to collectively as holographic interconnect devices. At present, holographic interconnect devices are either fixed holograms or spatial light modulators. Optically or computer generated holograms (submicron resolution, 2-D or 3-D, encoding 1013 bits, nearly 100 diffraction efficiency) can implement sophisticated mathematical design principles, but of course once fabricated they cannot be changed. Spatial light modulators offer high speed programmability but have limited resolution (512 X 512 pixels, encoding about 106 bits of data) and limited diffraction efficiency. For any application, one must choose between high diffractive performance and programmability.
Importance of balanced architectures in the design of high-performance imaging systems
NASA Astrophysics Data System (ADS)
Sgro, Joseph A.; Stanton, Paul C.
1999-03-01
Imaging systems employed in demanding military and industrial applications, such as automatic target recognition and computer vision, typically require real-time high-performance computing resources. While high- performances computing systems have traditionally relied on proprietary architectures and custom components, recent advances in high performance general-purpose microprocessor technology have produced an abundance of low cost components suitable for use in high-performance computing systems. A common pitfall in the design of high performance imaging system, particularly systems employing scalable multiprocessor architectures, is the failure to balance computational and memory bandwidth. The performance of standard cluster designs, for example, in which several processors share a common memory bus, is typically constrained by memory bandwidth. The symptom characteristic of this problem is failure to the performance of the system to scale as more processors are added. The problem becomes exacerbated if I/O and memory functions share the same bus. The recent introduction of microprocessors with large internal caches and high performance external memory interfaces makes it practical to design high performance imaging system with balanced computational and memory bandwidth. Real word examples of such designs will be presented, along with a discussion of adapting algorithm design to best utilize available memory bandwidth.
A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-volatile On-chip Caches
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mittal, Sparsh; Vetter, Jeffrey S; Li, Dong
Recent trends of CMOS scaling and increasing number of on-chip cores have led to a large increase in the size of on-chip caches. Since SRAM has low density and consumes large amount of leakage power, its use in designing on-chip caches has become more challenging. To address this issue, researchers are exploring the use of several emerging memory technologies, such as embedded DRAM, spin transfer torque RAM, resistive RAM, phase change RAM and domain wall memory. In this paper, we survey the architectural approaches proposed for designing memory systems and, specifically, caches with these emerging memory technologies. To highlight theirmore » similarities and differences, we present a classification of these technologies and architectural approaches based on their key characteristics. We also briefly summarize the challenges in using these technologies for architecting caches. We believe that this survey will help the readers gain insights into the emerging memory device technologies, and their potential use in designing future computing systems.« less
A Locality-Based Threading Algorithm for the Configuration-Interaction Method
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shan, Hongzhang; Williams, Samuel; Johnson, Calvin
The Configuration Interaction (CI) method has been widely used to solve the non-relativistic many-body Schrodinger equation. One great challenge to implementing it efficiently on manycore architectures is its immense memory and data movement requirements. To address this issue, within each node, we exploit a hybrid MPI+OpenMP programming model in lieu of the traditional flat MPI programming model. Here in this paper, we develop optimizations that partition the workloads among OpenMP threads based on data locality,-which is essential in ensuring applications with complex data access patterns scale well on manycore architectures. The new algorithm scales to 256 threadson the 64-core Intelmore » Knights Landing (KNL) manycore processor and 24 threads on dual-socket Ivy Bridge (Xeon) nodes. Compared with the original implementation, the performance has been improved by up to 7× on theKnights Landing processor and 3× on the dual-socket Ivy Bridge node.« less
A Locality-Based Threading Algorithm for the Configuration-Interaction Method
Shan, Hongzhang; Williams, Samuel; Johnson, Calvin; ...
2017-07-03
The Configuration Interaction (CI) method has been widely used to solve the non-relativistic many-body Schrodinger equation. One great challenge to implementing it efficiently on manycore architectures is its immense memory and data movement requirements. To address this issue, within each node, we exploit a hybrid MPI+OpenMP programming model in lieu of the traditional flat MPI programming model. Here in this paper, we develop optimizations that partition the workloads among OpenMP threads based on data locality,-which is essential in ensuring applications with complex data access patterns scale well on manycore architectures. The new algorithm scales to 256 threadson the 64-core Intelmore » Knights Landing (KNL) manycore processor and 24 threads on dual-socket Ivy Bridge (Xeon) nodes. Compared with the original implementation, the performance has been improved by up to 7× on theKnights Landing processor and 3× on the dual-socket Ivy Bridge node.« less
An ECG ambulatory system with mobile embedded architecture for ST-segment analysis.
Miranda-Cid, Alejandro; Alvarado-Serrano, Carlos
2010-01-01
A prototype of a ECG ambulatory system for long term monitoring of ST segment of 3 leads, low power, portability and data storage in solid state memory cards has been developed. The solution presented is based in a mobile embedded architecture of a portable entertainment device used as a tool for storage and processing of bioelectric signals, and a mid-range RISC microcontroller, PIC 16F877, which performs the digitalization and transmission of ECG. The ECG amplifier stage is a low power, unipolar voltage and presents minimal distortion of the phase response of high pass filter in the ST segment. We developed an algorithm that manages access to files through an implementation for FAT32, and the ECG display on the device screen. The records are stored in TXT format for further processing. After the acquisition, the system implemented works as a standard USB mass storage device.
Development of Network Interface Cards for TRIDAQ systems with the NaNet framework
NASA Astrophysics Data System (ADS)
Ammendola, R.; Biagioni, A.; Cretaro, P.; Di Lorenzo, S.; Fiorini, M.; Frezza, O.; Lamanna, G.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Neri, I.; Paolucci, P. S.; Pastorelli, E.; Piandani, R.; Pontisso, L.; Rossetti, D.; Simula, F.; Sozzi, M.; Valente, P.; Vicini, P.
2017-03-01
NaNet is a framework for the development of FPGA-based PCI Express (PCIe) Network Interface Cards (NICs) with real-time data transport architecture that can be effectively employed in TRIDAQ systems. Key features of the architecture are the flexibility in the configuration of the number and kind of the I/O channels, the hardware offloading of the network protocol stack, the stream processing capability, and the zero-copy CPU and GPU Remote Direct Memory Access (RDMA). Three NIC designs have been developed with the NaNet framework: NaNet-1 and NaNet-10 for the CERN NA62 low level trigger and NaNet3 for the KM3NeT-IT underwater neutrino telescope DAQ system. We will focus our description on the NaNet-10 design, as it is the most complete of the three in terms of capabilities and integrated IPs of the framework.
Asymmetrical access to color and location in visual working memory.
Rajsic, Jason; Wilson, Daryl E
2014-10-01
Models of visual working memory (VWM) have benefitted greatly from the use of the delayed-matching paradigm. However, in this task, the ability to recall a probed feature is confounded with the ability to maintain the proper binding between the feature that is to be reported and the feature (typically location) that is used to cue a particular item for report. Given that location is typically used as a cue-feature, we used the delayed-estimation paradigm to compare memory for location to memory for color, rotating which feature was used as a cue and which was reported. Our results revealed several novel findings: 1) the likelihood of reporting a probed object's feature was superior when reporting location with a color cue than when reporting color with a location cue; 2) location report errors were composed entirely of swap errors, with little to no random location reports; and 3) both colour and location reports greatly benefitted from the presence of nonprobed items at test. This last finding suggests that it is uncertainty over the bindings between locations and colors at memory retrieval that drive swap errors, not at encoding. We interpret our findings as consistent with a representational architecture that nests remembered object features within remembered locations.
Binary mesh partitioning for cache-efficient visualization.
Tchiboukdjian, Marc; Danjean, Vincent; Raffin, Bruno
2010-01-01
One important bottleneck when visualizing large data sets is the data transfer between processor and memory. Cache-aware (CA) and cache-oblivious (CO) algorithms take into consideration the memory hierarchy to design cache efficient algorithms. CO approaches have the advantage to adapt to unknown and varying memory hierarchies. Recent CA and CO algorithms developed for 3D mesh layouts significantly improve performance of previous approaches, but they lack of theoretical performance guarantees. We present in this paper a {\\schmi O}(N\\log N) algorithm to compute a CO layout for unstructured but well shaped meshes. We prove that a coherent traversal of a N-size mesh in dimension d induces less than N/B+{\\schmi O}(N/M;{1/d}) cache-misses where B and M are the block size and the cache size, respectively. Experiments show that our layout computation is faster and significantly less memory consuming than the best known CO algorithm. Performance is comparable to this algorithm for classical visualization algorithm access patterns, or better when the BSP tree produced while computing the layout is used as an acceleration data structure adjusted to the layout. We also show that cache oblivious approaches lead to significant performance increases on recent GPU architectures.
Hypercluster Parallel Processor
NASA Technical Reports Server (NTRS)
Blech, Richard A.; Cole, Gary L.; Milner, Edward J.; Quealy, Angela
1992-01-01
Hypercluster computer system includes multiple digital processors, operation of which coordinated through specialized software. Configurable according to various parallel-computing architectures of shared-memory or distributed-memory class, including scalar computer, vector computer, reduced-instruction-set computer, and complex-instruction-set computer. Designed as flexible, relatively inexpensive system that provides single programming and operating environment within which one can investigate effects of various parallel-computing architectures and combinations on performance in solution of complicated problems like those of three-dimensional flows in turbomachines. Hypercluster software and architectural concepts are in public domain.
Lee, Ke-Jing; Chang, Yu-Chi; Lee, Cheng-Jung; Wang, Li-Wen; Wang, Yeong-Her
2017-12-09
A one-transistor and one-resistor (1T1R) architecture with a resistive random access memory (RRAM) cell connected to an organic thin-film transistor (OTFT) device is successfully demonstrated to avoid the cross-talk issues of only one RRAM cell. The OTFT device, which uses barium zirconate nickelate (BZN) as a dielectric layer, exhibits favorable electrical properties, such as a high field-effect mobility of 5 cm²/Vs, low threshold voltage of -1.1 V, and low leakage current of 10 -12 A, for a driver in the 1T1R operation scheme. The 1T1R architecture with a TiO₂-based RRAM cell connected with a BZN OTFT device indicates a low operation current (10 μA) and reliable data retention (over ten years). This favorable performance of the 1T1R device can be attributed to the additional barrier heights introduced by using Ni (II) acetylacetone as a substitute for acetylacetone, and the relatively low leakage current of a BZN dielectric layer. The proposed 1T1R device with low leakage current OTFT and excellent uniform resistance distribution of RRAM exhibits a good potential for use in practical low-power electronic applications.
2008-05-01
patterns. Our strategy to nucleate Ag nanoparticles has been to use a templating protein (e.g., streptavidin) that has been chemically pre- charged with...assembly is used to direct the formation of switching devices and wires to create logic circuitry, memory, and I/O interfaces . We can control the reaction...determines the formation of structures (through complementarity ). Sequence design is important because it determines many aspects of the target DNA
System, methods and apparatus for program optimization for multi-threaded processor architectures
Bastoul, Cedric; Lethin, Richard A; Leung, Allen K; Meister, Benoit J; Szilagyi, Peter; Vasilache, Nicolas T; Wohlford, David E
2015-01-06
Methods, apparatus and computer software product for source code optimization are provided. In an exemplary embodiment, a first custom computing apparatus is used to optimize the execution of source code on a second computing apparatus. In this embodiment, the first custom computing apparatus contains a memory, a storage medium and at least one processor with at least one multi-stage execution unit. The second computing apparatus contains at least two multi-stage execution units that allow for parallel execution of tasks. The first custom computing apparatus optimizes the code for parallelism, locality of operations and contiguity of memory accesses on the second computing apparatus. This Abstract is provided for the sole purpose of complying with the Abstract requirement rules. This Abstract is submitted with the explicit understanding that it will not be used to interpret or to limit the scope or the meaning of the claims.
Generating Performance Models for Irregular Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Friese, Ryan D.; Tallent, Nathan R.; Vishnu, Abhinav
2017-05-30
Many applications have irregular behavior --- non-uniform input data, input-dependent solvers, irregular memory accesses, unbiased branches --- that cannot be captured using today's automated performance modeling techniques. We describe new hierarchical critical path analyses for the \\Palm model generation tool. To create a model's structure, we capture tasks along representative MPI critical paths. We create a histogram of critical tasks with parameterized task arguments and instance counts. To model each task, we identify hot instruction-level sub-paths and model each sub-path based on data flow, instruction scheduling, and data locality. We describe application models that generate accurate predictions for strong scalingmore » when varying CPU speed, cache speed, memory speed, and architecture. We present results for the Sweep3D neutron transport benchmark; Page Rank on multiple graphs; Support Vector Machine with pruning; and PFLOTRAN's reactive flow/transport solver with domain-induced load imbalance.« less
Key Technologies of Phone Storage Forensics Based on ARM Architecture
NASA Astrophysics Data System (ADS)
Zhang, Jianghan; Che, Shengbing
2018-03-01
Smart phones are mainly running Android, IOS and Windows Phone three mobile platform operating systems. The android smart phone has the best market shares and its processor chips are almost ARM software architecture. The chips memory address mapping mechanism of ARM software architecture is different with x86 software architecture. To forensics to android mart phone, we need to understand three key technologies: memory data acquisition, the conversion mechanism from virtual address to the physical address, and find the system’s key data. This article presents a viable solution which does not rely on the operating system API for a complete solution to these three issues.
A learnable parallel processing architecture towards unity of memory and computing
NASA Astrophysics Data System (ADS)
Li, H.; Gao, B.; Chen, Z.; Zhao, Y.; Huang, P.; Ye, H.; Liu, L.; Liu, X.; Kang, J.
2015-08-01
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named “iMemComp”, where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped “iMemComp” with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on “iMemComp” can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.
A learnable parallel processing architecture towards unity of memory and computing.
Li, H; Gao, B; Chen, Z; Zhao, Y; Huang, P; Ye, H; Liu, L; Liu, X; Kang, J
2015-08-14
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named "iMemComp", where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped "iMemComp" with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on "iMemComp" can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.
A Compute Capable SSD Architecture for Next-Generation Non-volatile Memories
DOE Office of Scientific and Technical Information (OSTI.GOV)
De, Arup
2014-01-01
Existing storage technologies (e.g., disks and ash) are failing to cope with the processor and main memory speed and are limiting the overall perfor- mance of many large scale I/O or data-intensive applications. Emerging fast byte-addressable non-volatile memory (NVM) technologies, such as phase-change memory (PCM), spin-transfer torque memory (STTM) and memristor are very promising and are approaching DRAM-like performance with lower power con- sumption and higher density as process technology scales. These new memories are narrowing down the performance gap between the storage and the main mem- ory and are putting forward challenging problems on existing SSD architecture, I/O interfacemore » (e.g, SATA, PCIe) and software. This dissertation addresses those challenges and presents a novel SSD architecture called XSSD. XSSD o oads com- putation in storage to exploit fast NVMs and reduce the redundant data tra c across the I/O bus. XSSD o ers a exible RPC-based programming framework that developers can use for application development on SSD without dealing with the complication of the underlying architecture and communication management. We have built a prototype of XSSD on the BEE3 FPGA prototyping system. We implement various data-intensive applications and achieve speedup and energy ef- ciency of 1.5-8.9 and 1.7-10.27 respectively. This dissertation also compares XSSD with previous work on intelligent storage and intelligent memory. The existing ecosystem and these new enabling technologies make this system more viable than earlier ones.« less
NASA Astrophysics Data System (ADS)
Weigand, R.
Two new processor devices have been developed for the use on board of spacecrafts. An 8-bit 8032-microcontroller targets typical controlling applications in instruments and sub-systems, or could be used as a main processor on small satellites, whereas the LEON 32-bit SPARC processor can be used for high performance controlling and data processing tasks. The ADV80S32 is fully compliant to the Intel 80x1 architecture and instruction set, extended by additional peripherals, 512 bytes on-chip RAM and a bootstrap PROM, which allows downloading the application software using the CCSDS PacketWire pro- tocol. The memory controller provides a de-multiplexed address/data bus, and allows to access up to 16 MB data and 8 MB program RAM. The peripherals have been de- signed for the specific needs of a spacecraft, such as serial interfaces compatible to RS232, PacketWire and TTC-B-01, counters/timers for extended duration and a CRC calculation unit accelerating the CCSDS TM/TC protocol. The 0.5 um Atmel manu- facturing technology (MG2RT) provides latch-up and total dose immunity; SEU fault immunity is implemented by using SEU hardened Flip-Flops and EDAC protection of internal and external memories. The maximum clock frequency of 20 MHz allows a processing power of 3 MIPS. Engineering samples are available. For SW develop- ment, various SW packages for the 8051 architecture are on the market. The LEON processor implements a 32-bit SPARC V8 architecture, including all the multiply and divide instructions, complemented by a floating-point unit (FPU). It includes several standard peripherals, such as timers/watchdog, interrupt controller, UARTs, parallel I/Os and a memory controller, allowing to use 8, 16 and 32 bit PROM, SRAM or memory mapped I/O. With on-chip separate instruction and data caches, almost one instruction per clock cycle can be reached in some applications. A 33-MHz 32-bit PCI master/target interface and a PCI arbiter allow operating the device in a plug-in card (for SW development on PC etc.), or to consider using it as a PCI master controller in an on-board system. Advanced SEU fault tolerance is in- troduced by design, using triple modular redundancy (TMR) flip-flops for all registers and EDAC protection for all memories. The device will be manufactured in a radia- tion hard Atmel 0.25 um technology, targeting 100 MHz processor clock frequency. The non fault-tolerant LEON processor VHDL model is available as free source code, and the SPARC architecture is a well-known industry standard. Therefore, know-how, software tools and operating systems are widely available.
Novel memory architecture for video signal processor
NASA Astrophysics Data System (ADS)
Hung, Jen-Sheng; Lin, Chia-Hsing; Jen, Chein-Wei
1993-11-01
An on-chip memory architecture for video signal processor (VSP) is proposed. This memory structure is a two-level design for the different data locality in video applications. The upper level--Memory A provides enough storage capacity to reduce the impact on the limitation of chip I/O bandwidth, and the lower level--Memory B provides enough data parallelism and flexibility to meet the requirements of multiple reconfigurable pipeline function units in a single VSP chip. The needed memory size is decided by the memory usage analysis for video algorithms and the number of function units. Both levels of memory adopted a dual-port memory scheme to sustain the simultaneous read and write operations. Especially, Memory B uses multiple one-read-one-write memory banks to emulate the real multiport memory. Therefore, one can change the configuration of Memory B to several sets of memories with variable read/write ports by adjusting the bus switches. Then the numbers of read ports and write ports in proposed memory can meet requirement of data flow patterns in different video coding algorithms. We have finished the design of a prototype memory design using 1.2- micrometers SPDM SRAM technology and will fabricated it through TSMC, in Taiwan.
Federal Register 2010, 2011, 2012, 2013, 2014
2011-11-29
... INTERNATIONAL TRADE COMMISSION [DN 2859] Certain Dynamic Random Access Memory Devices, and.... International Trade Commission has received a complaint entitled In Re Certain Dynamic Random Access Memory... certain dynamic random access memory devices, and products containing same. The complaint names Elpida...
Federal Register 2010, 2011, 2012, 2013, 2014
2010-04-01
... Semiconductor Chips Having Synchronous Dynamic Random Access Memory Controllers and Products Containing Same... synchronous dynamic random access memory controllers and products containing same by reason of infringement of... semiconductor chips having synchronous dynamic random access memory controllers and products containing same...
Announcing Supercomputer Summit
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wells, Jack; Bland, Buddy; Nichols, Jeff
Summit is the next leap in leadership-class computing systems for open science. With Summit we will be able to address, with greater complexity and higher fidelity, questions concerning who we are, our place on earth, and in our universe. Summit will deliver more than five times the computational performance of Titan’s 18,688 nodes, using only approximately 3,400 nodes when it arrives in 2017. Like Titan, Summit will have a hybrid architecture, and each node will contain multiple IBM POWER9 CPUs and NVIDIA Volta GPUs all connected together with NVIDIA’s high-speed NVLink. Each node will have over half a terabyte ofmore » coherent memory (high bandwidth memory + DDR4) addressable by all CPUs and GPUs plus 800GB of non-volatile RAM that can be used as a burst buffer or as extended memory. To provide a high rate of I/O throughput, the nodes will be connected in a non-blocking fat-tree using a dual-rail Mellanox EDR InfiniBand interconnect. Upon completion, Summit will allow researchers in all fields of science unprecedented access to solving some of the world’s most pressing challenges.« less
Implementation of real-time digital signal processing systems
NASA Technical Reports Server (NTRS)
Narasimha, M.; Peterson, A.; Narayan, S.
1978-01-01
Special purpose hardware implementation of DFT Computers and digital filters is considered in the light of newly introduced algorithms and IC devices. Recent work by Winograd on high-speed convolution techniques for computing short length DFT's, has motivated the development of more efficient algorithms, compared to the FFT, for evaluating the transform of longer sequences. Among these, prime factor algorithms appear suitable for special purpose hardware implementations. Architectural considerations in designing DFT computers based on these algorithms are discussed. With the availability of monolithic multiplier-accumulators, a direct implementation of IIR and FIR filters, using random access memories in place of shift registers, appears attractive. The memory addressing scheme involved in such implementations is discussed. A simple counter set-up to address the data memory in the realization of FIR filters is also described. The combination of a set of simple filters (weighting network) and a DFT computer is shown to realize a bank of uniform bandpass filters. The usefulness of this concept in arriving at a modular design for a million channel spectrum analyzer, based on microprocessors, is discussed.
Ohmacht, Martin
2017-08-15
In a multiprocessor system, a central memory synchronization module coordinates memory synchronization requests responsive to memory access requests in flight, a generation counter, and a reclaim pointer. The central module communicates via point-to-point communication. The module includes a global OR reduce tree for each memory access requesting device, for detecting memory access requests in flight. An interface unit is implemented associated with each processor requesting synchronization. The interface unit includes multiple generation completion detectors. The generation count and reclaim pointer do not pass one another.
Ohmacht, Martin
2014-09-09
In a multiprocessor system, a central memory synchronization module coordinates memory synchronization requests responsive to memory access requests in flight, a generation counter, and a reclaim pointer. The central module communicates via point-to-point communication. The module includes a global OR reduce tree for each memory access requesting device, for detecting memory access requests in flight. An interface unit is implemented associated with each processor requesting synchronization. The interface unit includes multiple generation completion detectors. The generation count and reclaim pointer do not pass one another.
Federal Register 2010, 2011, 2012, 2013, 2014
2011-09-07
... Access Memory and Nand Flash Memory Devices and Products Containing Same; Notice of Institution of... importation, and the sale within the United States after importation of certain dynamic random access memory and NAND flash memory devices and products containing same by reason of infringement of certain claims...
Li, Guoqi; Deng, Lei; Wang, Dong; Wang, Wei; Zeng, Fei; Zhang, Ziyang; Li, Huanglong; Song, Sen; Pei, Jing; Shi, Luping
2016-01-01
Chunking refers to a phenomenon whereby individuals group items together when performing a memory task to improve the performance of sequential memory. In this work, we build a bio-plausible hierarchical chunking of sequential memory (HCSM) model to explain why such improvement happens. We address this issue by linking hierarchical chunking with synaptic plasticity and neuromorphic engineering. We uncover that a chunking mechanism reduces the requirements of synaptic plasticity since it allows applying synapses with narrow dynamic range and low precision to perform a memory task. We validate a hardware version of the model through simulation, based on measured memristor behavior with narrow dynamic range in neuromorphic circuits, which reveals how chunking works and what role it plays in encoding sequential memory. Our work deepens the understanding of sequential memory and enables incorporating it for the investigation of the brain-inspired computing on neuromorphic architecture. PMID:28066223
Improved Writing-Conductor Designs For Magnetic Memory
NASA Technical Reports Server (NTRS)
Wu, Jiin-Chuan; Stadler, Henry L.; Katti, Romney R.
1994-01-01
Writing currents reduced to practical levels. Improved conceptual designs for writing conductors in micromagnet/Hall-effect random-access integrated-circuit memory reduces electrical current needed to magnetize micromagnet in each memory cell. Basic concept of micromagnet/Hall-effect random-access memory presented in "Magnetic Analog Random-Access Memory" (NPO-17999).
Federal Register 2010, 2011, 2012, 2013, 2014
2010-03-25
... Access Memory Semiconductors and Products Containing Same, Including Memory Modules; Notice of... the sale within the United States after importation of certain dynamic random access memory semiconductors and products containing same, including memory modules, by reason of infringement of certain...
Federal Register 2010, 2011, 2012, 2013, 2014
2011-12-27
... INTERNATIONAL TRADE COMMISSION [Investigation No. 337-TA-821] Certain Dynamic Random Access Memory... importation, and the sale within the United States after importation of certain dynamic random access memory... certain dynamic random access memory devices, and products containing same that infringe one or more of...
Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Villa, Oreste; Tumeo, Antonino; Secchi, Simone
Irregular applications, such as data mining and analysis or graph-based computations, show unpredictable memory/network access patterns and control structures. Highly multithreaded architectures with large processor counts, like the Cray MTA-1, MTA-2 and XMT, appear to address their requirements better than commodity clusters. However, the research on highly multithreaded systems is currently limited by the lack of adequate architectural simulation infrastructures due to issues such as size of the machines, memory footprint, simulation speed, accuracy and customization. At the same time, Shared-memory MultiProcessors (SMPs) with multi-core processors have become an attractive platform to simulate large scale machines. In this paper, wemore » introduce a cycle-level simulator of the highly multithreaded Cray XMT supercomputer. The simulator runs unmodified XMT applications. We discuss how we tackled the challenges posed by its development, detailing the techniques introduced to make the simulation as fast as possible while maintaining a high accuracy. By mapping XMT processors (ThreadStorm with 128 hardware threads) to host computing cores, the simulation speed remains constant as the number of simulated processors increases, up to the number of available host cores. The simulator supports zero-overhead switching among different accuracy levels at run-time and includes a network model that takes into account contention. On a modern 48-core SMP host, our infrastructure simulates a large set of irregular applications 500 to 2000 times slower than real time when compared to a 128-processor XMT, while remaining within 10\\% of accuracy. Emulation is only from 25 to 200 times slower than real time.« less
CMOS VLSI Active-Pixel Sensor for Tracking
NASA Technical Reports Server (NTRS)
Pain, Bedabrata; Sun, Chao; Yang, Guang; Heynssens, Julie
2004-01-01
An architecture for a proposed active-pixel sensor (APS) and a design to implement the architecture in a complementary metal oxide semiconductor (CMOS) very-large-scale integrated (VLSI) circuit provide for some advanced features that are expected to be especially desirable for tracking pointlike features of stars. The architecture would also make this APS suitable for robotic- vision and general pointing and tracking applications. CMOS imagers in general are well suited for pointing and tracking because they can be configured for random access to selected pixels and to provide readout from windows of interest within their fields of view. However, until now, the architectures of CMOS imagers have not supported multiwindow operation or low-noise data collection. Moreover, smearing and motion artifacts in collected images have made prior CMOS imagers unsuitable for tracking applications. The proposed CMOS imager (see figure) would include an array of 1,024 by 1,024 pixels containing high-performance photodiode-based APS circuitry. The pixel pitch would be 9 m. The operations of the pixel circuits would be sequenced and otherwise controlled by an on-chip timing and control block, which would enable the collection of image data, during a single frame period, from either the full frame (that is, all 1,024 1,024 pixels) or from within as many as 8 different arbitrarily placed windows as large as 8 by 8 pixels each. A typical prior CMOS APS operates in a row-at-a-time ( grolling-shutter h) readout mode, which gives rise to exposure skew. In contrast, the proposed APS would operate in a sample-first/readlater mode, suppressing rolling-shutter effects. In this mode, the analog readout signals from the pixels corresponding to the windows of the interest (which windows, in the star-tracking application, would presumably contain guide stars) would be sampled rapidly by routing them through a programmable diagonal switch array to an on-chip parallel analog memory array. The diagonal-switch and memory addresses would be generated by the on-chip controller. The memory array would be large enough to hold differential signals acquired from all 8 windows during a frame period. Following the rapid sampling from all the windows, the contents of the memory array would be read out sequentially by use of a capacitive transimpedance amplifier (CTIA) at a maximum data rate of 10 MHz. This data rate is compatible with an update rate of almost 10 Hz, even in full-frame operation
Method and apparatus for managing access to a memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
DeBenedictis, Erik
A method and apparatus for managing access to a memory of a computing system. A controller transforms a plurality of operations that represent a computing job into an operational memory layout that reduces a size of a selected portion of the memory that needs to be accessed to perform the computing job. The controller stores the operational memory layout in a plurality of memory cells within the selected portion of the memory. The controller controls a sequence by which a processor in the computing system accesses the memory to perform the computing job using the operational memory layout. The operationalmore » memory layout reduces an amount of energy consumed by the processor to perform the computing job.« less
Pu, Y-F; Jiang, N; Chang, W; Yang, H-X; Li, C; Duan, L-M
2017-05-08
To realize long-distance quantum communication and quantum network, it is required to have multiplexed quantum memory with many memory cells. Each memory cell needs to be individually addressable and independently accessible. Here we report an experiment that realizes a multiplexed DLCZ-type quantum memory with 225 individually accessible memory cells in a macroscopic atomic ensemble. As a key element for quantum repeaters, we demonstrate that entanglement with flying optical qubits can be stored into any neighboring memory cells and read out after a programmable time with high fidelity. Experimental realization of a multiplexed quantum memory with many individually accessible memory cells and programmable control of its addressing and readout makes an important step for its application in quantum information technology.
Recent Trends in Spintronics-Based Nanomagnetic Logic
NASA Astrophysics Data System (ADS)
Das, Jayita; Alam, Syed M.; Bhanja, Sanjukta
2014-09-01
With the growing concerns of standby power in sub-100-nm CMOS technologies, alternative computing techniques and memory technologies are explored. Spin transfer torque magnetoresistive RAM (STT-MRAM) is one such nonvolatile memory relying on magnetic tunnel junctions (MTJs) to store information. It uses spin transfer torque to write information and magnetoresistance to read information. In 2012, Everspin Technologies, Inc. commercialized the first 64Mbit Spin Torque MRAM. On the computing end, nanomagnetic logic (NML) is a promising technique with zero leakage and high data retention. In 2000, Cowburn and Welland first demonstrated its potential in logic and information propagation through magnetostatic interaction in a chain of single domain circular nanomagnetic dots of Supermalloy (Ni80Fe14Mo5X1, X is other metals). In 2006, Imre et al. demonstrated wires and majority gates followed by coplanar cross wire systems demonstration in 2010 by Pulecio et al. Since 2004 researchers have also investigated the potential of MTJs in logic. More recently with dipolar coupling between MTJs demonstrated in 2012, logic-in-memory architecture with STT-MRAM have been investigated. The architecture borrows the computing concept from NML and read and write style from MRAM. The architecture can switch its operation between logic and memory modes with clock as classifier. Further through logic partitioning between MTJ and CMOS plane, a significant performance boost has been observed in basic computing blocks within the architecture. In this work, we have explored the developments in NML, in MTJs and more recent developments in hybrid MTJ/CMOS logic-in-memory architecture and its unique logic partitioning capability.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bender, Michael A.; Berry, Jonathan W.; Hammond, Simon D.
A challenge in computer architecture is that processors often cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. With this motivation and the realization that traditional architectures (with all DRAM reachable only via bus) are insufficient to feed groups of modern processing units, vendors have introduced a variety of non-DDR 3D memory technologies (Hybrid Memory Cube (HMC),Wide I/O 2, High Bandwidth Memory (HBM)). These offer higher bandwidth and lower power by stacking DRAM chips on the processor or nearby on a silicon interposer. We will call these solutions “near-memory,” andmore » if user-addressable, “scratchpad.” High-performance systems on the market now offer two levels of main memory: near-memory on package and traditional DRAM further away. In the near term we expect the latencies near-memory and DRAM to be similar. Here, it is natural to think of near-memory as another module on the DRAM level of the memory hierarchy. Vendors are expected to offer modes in which the near memory is used as cache, but we believe that this will be inefficient.« less
Portable and Error-Free DNA-Based Data Storage.
Yazdi, S M Hossein Tabatabaei; Gabrys, Ryan; Milenkovic, Olgica
2017-07-10
DNA-based data storage is an emerging nonvolatile memory technology of potentially unprecedented density, durability, and replication efficiency. The basic system implementation steps include synthesizing DNA strings that contain user information and subsequently retrieving them via high-throughput sequencing technologies. Existing architectures enable reading and writing but do not offer random-access and error-free data recovery from low-cost, portable devices, which is crucial for making the storage technology competitive with classical recorders. Here we show for the first time that a portable, random-access platform may be implemented in practice using nanopore sequencers. The novelty of our approach is to design an integrated processing pipeline that encodes data to avoid costly synthesis and sequencing errors, enables random access through addressing, and leverages efficient portable sequencing via new iterative alignment and deletion error-correcting codes. Our work represents the only known random access DNA-based data storage system that uses error-prone nanopore sequencers, while still producing error-free readouts with the highest reported information rate/density. As such, it represents a crucial step towards practical employment of DNA molecules as storage media.
A Vertical Organic Transistor Architecture for Fast Nonvolatile Memory.
She, Xiao-Jian; Gustafsson, David; Sirringhaus, Henning
2017-02-01
A new device architecture for fast organic transistor memory is developed, based on a vertical organic transistor configuration incorporating high-performance ambipolar conjugated polymers and unipolar small molecules as the transport layers, to achieve reliable and fast programming and erasing of the threshold voltage shift in less than 200 ns. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Outline of a novel architecture for cortical computation.
Majumdar, Kaushik
2008-03-01
In this paper a novel architecture for cortical computation has been proposed. This architecture is composed of computing paths consisting of neurons and synapses. These paths have been decomposed into lateral, longitudinal and vertical components. Cortical computation has then been decomposed into lateral computation (LaC), longitudinal computation (LoC) and vertical computation (VeC). It has been shown that various loop structures in the cortical circuit play important roles in cortical computation as well as in memory storage and retrieval, keeping in conformity with the molecular basis of short and long term memory. A new learning scheme for the brain has also been proposed and how it is implemented within the proposed architecture has been explained. A few mathematical results about the architecture have been proposed, some of which are without proof.
Proceedings of the Mobile Satellite System Architectures and Multiple Access Techniques Workshop
NASA Technical Reports Server (NTRS)
Dessouky, Khaled
1989-01-01
The Mobile Satellite System Architectures and Multiple Access Techniques Workshop served as a forum for the debate of system and network architecture issues. Particular emphasis was on those issues relating to the choice of multiple access technique(s) for the Mobile Satellite Service (MSS). These proceedings contain articles that expand upon the 12 presentations given in the workshop. Contrasting views on Frequency Division Multiple Access (FDMA), Code Division Multiple Access (CDMA), and Time Division Multiple Access (TDMA)-based architectures are presented, and system issues relating to signaling, spacecraft design, and network management constraints are addressed. An overview article that summarizes the issues raised in the numerous discussion periods of the workshop is also included.
NASA Astrophysics Data System (ADS)
Kelley, Troy D.; McGhee, S.
2013-05-01
This paper describes the ongoing development of a robotic control architecture that inspired by computational cognitive architectures from the discipline of cognitive psychology. The Symbolic and Sub-Symbolic Robotics Intelligence Control System (SS-RICS) combines symbolic and sub-symbolic representations of knowledge into a unified control architecture. The new architecture leverages previous work in cognitive architectures, specifically the development of the Adaptive Character of Thought-Rational (ACT-R) and Soar. This paper details current work on learning from episodes or events. The use of episodic memory as a learning mechanism has, until recently, been largely ignored by computational cognitive architectures. This paper details work on metric level episodic memory streams and methods for translating episodes into abstract schemas. The presentation will include research on learning through novelty and self generated feedback mechanisms for autonomous systems.
Flexible cognitive resources: competitive content maps for attention and memory
Franconeri, Steven L.; Alvarez, George A.; Cavanagh, Patrick
2013-01-01
The brain has finite processing resources so that, as tasks become harder, performance degrades. Where do the limits on these resources come from? We focus on a variety of capacity-limited buffers related to attention, recognition, and memory that we claim have a two-dimensional ‘map’ architecture, where individual items compete for cortical real estate. This competitive format leads to capacity limits that are flexible, set by the nature of the content and their locations within an anatomically delimited space. We contrast this format with the standard ‘slot’ architecture and its fixed capacity. Using visual spatial attention and visual short-term memory as case studies, we suggest that competitive maps are a concrete and plausible architecture that limits cognitive capacity across many domains. PMID:23428935
The architecture of tomorrow's massively parallel computer
NASA Technical Reports Server (NTRS)
Batcher, Ken
1987-01-01
Goodyear Aerospace delivered the Massively Parallel Processor (MPP) to NASA/Goddard in May 1983, over three years ago. Ever since then, Goodyear has tried to look in a forward direction. There is always some debate as to which way is forward when it comes to supercomputer architecture. Improvements to the MPP's massively parallel architecture are discussed in the areas of data I/O, memory capacity, connectivity, and indirect (or local) addressing. In I/O, transfer rates up to 640 megabytes per second can be achieved. There are devices that can supply the data and accept it at this rate. The memory capacity can be increased up to 128 megabytes in the ARU and over a gigabyte in the staging memory. For connectivity, there are several different kinds of multistage networks that should be considered.
Accelerating molecular dynamic simulation on the cell processor and Playstation 3.
Luttmann, Edgar; Ensign, Daniel L; Vaidyanathan, Vishal; Houston, Mike; Rimon, Noam; Øland, Jeppe; Jayachandran, Guha; Friedrichs, Mark; Pande, Vijay S
2009-01-30
Implementation of molecular dynamics (MD) calculations on novel architectures will vastly increase its power to calculate the physical properties of complex systems. Herein, we detail algorithmic advances developed to accelerate MD simulations on the Cell processor, a commodity processor found in PlayStation 3 (PS3). In particular, we discuss issues regarding memory access versus computation and the types of calculations which are best suited for streaming processors such as the Cell, focusing on implicit solvation models. We conclude with a comparison of improved performance on the PS3's Cell processor over more traditional processors. (c) 2008 Wiley Periodicals, Inc.
Fast and Scalable Computation of the Forward and Inverse Discrete Periodic Radon Transform.
Carranza, Cesar; Llamocca, Daniel; Pattichis, Marios
2016-01-01
The discrete periodic radon transform (DPRT) has extensively been used in applications that involve image reconstructions from projections. Beyond classic applications, the DPRT can also be used to compute fast convolutions that avoids the use of floating-point arithmetic associated with the use of the fast Fourier transform. Unfortunately, the use of the DPRT has been limited by the need to compute a large number of additions and the need for a large number of memory accesses. This paper introduces a fast and scalable approach for computing the forward and inverse DPRT that is based on the use of: a parallel array of fixed-point adder trees; circular shift registers to remove the need for accessing external memory components when selecting the input data for the adder trees; an image block-based approach to DPRT computation that can fit the proposed architecture to available resources; and fast transpositions that are computed in one or a few clock cycles that do not depend on the size of the input image. As a result, for an N × N image (N prime), the proposed approach can compute up to N(2) additions per clock cycle. Compared with the previous approaches, the scalable approach provides the fastest known implementations for different amounts of computational resources. For example, for a 251×251 image, for approximately 25% fewer flip-flops than required for a systolic implementation, we have that the scalable DPRT is computed 36 times faster. For the fastest case, we introduce optimized just 2N + ⌈log(2) N⌉ + 1 and 2N + 3 ⌈log(2) N⌉ + B + 2 cycles, architectures that can compute the DPRT and its inverse in respectively, where B is the number of bits used to represent each input pixel. On the other hand, the scalable DPRT approach requires more 1-b additions than for the systolic implementation and provides a tradeoff between speed and additional 1-b additions. All of the proposed DPRT architectures were implemented in VHSIC Hardware Description Language (VHDL) and validated using an Field-Programmable Gate Array (FPGA) implementation.
Pu, Y-F; Jiang, N.; Chang, W.; Yang, H-X; Li, C.; Duan, L-M
2017-01-01
To realize long-distance quantum communication and quantum network, it is required to have multiplexed quantum memory with many memory cells. Each memory cell needs to be individually addressable and independently accessible. Here we report an experiment that realizes a multiplexed DLCZ-type quantum memory with 225 individually accessible memory cells in a macroscopic atomic ensemble. As a key element for quantum repeaters, we demonstrate that entanglement with flying optical qubits can be stored into any neighboring memory cells and read out after a programmable time with high fidelity. Experimental realization of a multiplexed quantum memory with many individually accessible memory cells and programmable control of its addressing and readout makes an important step for its application in quantum information technology. PMID:28480891
Hamlet, Jason R [Albuquerque, NM; Robertson, Perry J [Albuquerque, NM; Pierson, Lyndon G [Albuquerque, NM; Olsberg, Ronald R [Albuquerque, NM
2012-02-28
A deflate decompressor includes at least one decompressor unit, a memory access controller, a feedback path, and an output buffer unit. The memory access controller is coupled to the decompressor unit via a data path and includes a data buffer to receive the data stream and temporarily buffer a first portion the data stream. The memory access controller transfers fixed length data units of the data stream from the data buffer to the decompressor unit with reference to a memory pointer pointing into the memory buffer. The feedback path couples the decompressor unit to the memory access controller to feed back decrement values to the memory access controller for updating the memory pointer. The decrement values each indicate a number of bits unused by the decompressor unit when decoding the fixed length data units. The output buffer unit buffers a second portion of the data stream after decompression.
Federal Register 2010, 2011, 2012, 2013, 2014
2010-07-28
... Random Access Memory Semiconductors and Products Containing Same, Including Memory Modules; Notice of a... importation of certain dynamic random access memory semiconductors and products containing same, including memory modules, by reason of infringement of certain claims of U.S. Patent Nos. 5,480,051; 5,422,309; 5...
MediLink: a wearable telemedicine system for emergency and mobile applications.
Koval, T; Dudziak, M
1999-01-01
The practical needs of the medical professional faced with critical care or emergency situations differ from those working in many environments where telemedicine and mobile computing have been introduced and tested. One constructive criticism of the telemedicine initiative has been to question what positive benefits are gained from videoconferencing, paperless transactions, and online access to patient record. With a goal of producing a positive answer to such questions an architecture for multipurpose mobile telemedicine applications has been developed. The core technology is based upon a wearable personal computer with a smart-card interface coupled with speech, pen, video input and wireless intranet connectivity. The TransPAC system with the MedLink software system is designed to provide an integrated solution for a broad range of health care functions where mobile and hands-free or limited-access systems are preferred or necessary and where the capabilities of other mobile devices are insufficient or inappropriate. Structured and noise-resistant speech-to-text interfacing plus the use of a web browser-like display, accessible through either a flatpanel, standard, or headset monitor, gives the beltpack TransPAC computer the functions of a complete desktop including PCMCIA card interfaces for internet connectivity and a secure smartcard with 16-bit microprocessor and upwards of 64K memory. The card acts to provide user access control for security, user custom configuration of applications and display and vocabulary, and memory to diminish the need for PC-server communications while in an active session. TransPAC is being implemented for EMT and ER staff usage.
Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation
Bender, Michael A.; Berry, Jonathan W.; Hammond, Simon D.; ...
2017-01-03
A challenge in computer architecture is that processors often cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. With this motivation and the realization that traditional architectures (with all DRAM reachable only via bus) are insufficient to feed groups of modern processing units, vendors have introduced a variety of non-DDR 3D memory technologies (Hybrid Memory Cube (HMC),Wide I/O 2, High Bandwidth Memory (HBM)). These offer higher bandwidth and lower power by stacking DRAM chips on the processor or nearby on a silicon interposer. We will call these solutions “near-memory,” andmore » if user-addressable, “scratchpad.” High-performance systems on the market now offer two levels of main memory: near-memory on package and traditional DRAM further away. In the near term we expect the latencies near-memory and DRAM to be similar. Here, it is natural to think of near-memory as another module on the DRAM level of the memory hierarchy. Vendors are expected to offer modes in which the near memory is used as cache, but we believe that this will be inefficient.« less
Overview of emerging nonvolatile memory technologies
2014-01-01
Nonvolatile memory technologies in Si-based electronics date back to the 1990s. Ferroelectric field-effect transistor (FeFET) was one of the most promising devices replacing the conventional Flash memory facing physical scaling limitations at those times. A variant of charge storage memory referred to as Flash memory is widely used in consumer electronic products such as cell phones and music players while NAND Flash-based solid-state disks (SSDs) are increasingly displacing hard disk drives as the primary storage device in laptops, desktops, and even data centers. The integration limit of Flash memories is approaching, and many new types of memory to replace conventional Flash memories have been proposed. Emerging memory technologies promise new memories to store more data at less cost than the expensive-to-build silicon chips used by popular consumer gadgets including digital cameras, cell phones and portable music players. They are being investigated and lead to the future as potential alternatives to existing memories in future computing systems. Emerging nonvolatile memory technologies such as magnetic random-access memory (MRAM), spin-transfer torque random-access memory (STT-RAM), ferroelectric random-access memory (FeRAM), phase-change memory (PCM), and resistive random-access memory (RRAM) combine the speed of static random-access memory (SRAM), the density of dynamic random-access memory (DRAM), and the nonvolatility of Flash memory and so become very attractive as another possibility for future memory hierarchies. Many other new classes of emerging memory technologies such as transparent and plastic, three-dimensional (3-D), and quantum dot memory technologies have also gained tremendous popularity in recent years. Subsequently, not an exaggeration to say that computer memory could soon earn the ultimate commercial validation for commercial scale-up and production the cheap plastic knockoff. Therefore, this review is devoted to the rapidly developing new class of memory technologies and scaling of scientific procedures based on an investigation of recent progress in advanced Flash memory devices. PMID:25278820
Overview of emerging nonvolatile memory technologies.
Meena, Jagan Singh; Sze, Simon Min; Chand, Umesh; Tseng, Tseung-Yuen
2014-01-01
Nonvolatile memory technologies in Si-based electronics date back to the 1990s. Ferroelectric field-effect transistor (FeFET) was one of the most promising devices replacing the conventional Flash memory facing physical scaling limitations at those times. A variant of charge storage memory referred to as Flash memory is widely used in consumer electronic products such as cell phones and music players while NAND Flash-based solid-state disks (SSDs) are increasingly displacing hard disk drives as the primary storage device in laptops, desktops, and even data centers. The integration limit of Flash memories is approaching, and many new types of memory to replace conventional Flash memories have been proposed. Emerging memory technologies promise new memories to store more data at less cost than the expensive-to-build silicon chips used by popular consumer gadgets including digital cameras, cell phones and portable music players. They are being investigated and lead to the future as potential alternatives to existing memories in future computing systems. Emerging nonvolatile memory technologies such as magnetic random-access memory (MRAM), spin-transfer torque random-access memory (STT-RAM), ferroelectric random-access memory (FeRAM), phase-change memory (PCM), and resistive random-access memory (RRAM) combine the speed of static random-access memory (SRAM), the density of dynamic random-access memory (DRAM), and the nonvolatility of Flash memory and so become very attractive as another possibility for future memory hierarchies. Many other new classes of emerging memory technologies such as transparent and plastic, three-dimensional (3-D), and quantum dot memory technologies have also gained tremendous popularity in recent years. Subsequently, not an exaggeration to say that computer memory could soon earn the ultimate commercial validation for commercial scale-up and production the cheap plastic knockoff. Therefore, this review is devoted to the rapidly developing new class of memory technologies and scaling of scientific procedures based on an investigation of recent progress in advanced Flash memory devices.
Computers in Academic Architecture Libraries.
ERIC Educational Resources Information Center
Willis, Alfred; And Others
1992-01-01
Computers are widely used in architectural research and teaching in U.S. schools of architecture. A survey of libraries serving these schools sought information on the emphasis placed on computers by the architectural curriculum, accessibility of computers to library staff, and accessibility of computers to library patrons. Survey results and…
NASA Technical Reports Server (NTRS)
Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas
2008-01-01
A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.
The Climate Data Analytic Services (CDAS) Framework.
NASA Astrophysics Data System (ADS)
Maxwell, T. P.; Duffy, D.
2016-12-01
Faced with unprecedented growth in climate data volume and demand, NASA has developed the Climate Data Analytic Services (CDAS) framework. This framework enables scientists to execute data processing workflows combining common analysis operations in a high performance environment close to the massive data stores at NASA. The data is accessed in standard (NetCDF, HDF, etc.) formats in a POSIX file system and processed using vetted climate data analysis tools (ESMF, CDAT, NCO, etc.). A dynamic caching architecture enables interactive response times. CDAS utilizes Apache Spark for parallelization and a custom array framework for processing huge datasets within limited memory spaces. CDAS services are accessed via a WPS API being developed in collaboration with the ESGF Compute Working Team to support server-side analytics for ESGF. The API can be accessed using either direct web service calls, a python script, a unix-like shell client, or a javascript-based web application. Client packages in python, scala, or javascript contain everything needed to make CDAS requests. The CDAS architecture brings together the tools, data storage, and high-performance computing required for timely analysis of large-scale data sets, where the data resides, to ultimately produce societal benefits. It is is currently deployed at NASA in support of the Collaborative REAnalysis Technical Environment (CREATE) project, which centralizes numerous global reanalysis datasets onto a single advanced data analytics platform. This service permits decision makers to investigate climate changes around the globe, inspect model trends and variability, and compare multiple reanalysis datasets.
The Effects of Architecture and Process on the Hardness of Programmable Technologies
NASA Technical Reports Server (NTRS)
Katz, Richard; Wang, J. J.; Reed, R.; Kleyner, I.; DOrdine, M.; McCollum, J,; Cronquist, B.; Howard, J.
1999-01-01
Architecture and process, combined, significantly affect the hardness of programmable technologies. The effects of high energy ions, ferroelectric memory architectures, and shallow trench isolation are investigated. A detailed single event latchup (SEL) study has been performed.
Roofline model toolkit: A practical tool for architectural and program analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lo, Yu Jung; Williams, Samuel; Van Straalen, Brian
We present preliminary results of the Roofline Toolkit for multicore, many core, and accelerated architectures. This paper focuses on the processor architecture characterization engine, a collection of portable instrumented micro benchmarks implemented with Message Passing Interface (MPI), and OpenMP used to express thread-level parallelism. These benchmarks are specialized to quantify the behavior of different architectural features. Compared to previous work on performance characterization, these microbenchmarks focus on capturing the performance of each level of the memory hierarchy, along with thread-level parallelism, instruction-level parallelism and explicit SIMD parallelism, measured in the context of the compilers and run-time environments. We also measuremore » sustained PCIe throughput with four GPU memory managed mechanisms. By combining results from the architecture characterization with the Roofline model based solely on architectural specifications, this work offers insights for performance prediction of current and future architectures and their software systems. To that end, we instrument three applications and plot their resultant performance on the corresponding Roofline model when run on a Blue Gene/Q architecture.« less
A Neural Network Architecture For Rapid Model Indexing In Computer Vision Systems
NASA Astrophysics Data System (ADS)
Pawlicki, Ted
1988-03-01
Models of objects stored in memory have been shown to be useful for guiding the processing of computer vision systems. A major consideration in such systems, however, is how stored models are initially accessed and indexed by the system. As the number of stored models increases, the time required to search memory for the correct model becomes high. Parallel distributed, connectionist, neural networks' have been shown to have appealing content addressable memory properties. This paper discusses an architecture for efficient storage and reference of model memories stored as stable patterns of activity in a parallel, distributed, connectionist, neural network. The emergent properties of content addressability and resistance to noise are exploited to perform indexing of the appropriate object centered model from image centered primitives. The system consists of three network modules each of which represent information relative to a different frame of reference. The model memory network is a large state space vector where fields in the vector correspond to ordered component objects and relative, object based spatial relationships between the component objects. The component assertion network represents evidence about the existence of object primitives in the input image. It establishes local frames of reference for object primitives relative to the image based frame of reference. The spatial relationship constraint network is an intermediate representation which enables the association between the object based and the image based frames of reference. This intermediate level represents information about possible object orderings and establishes relative spatial relationships from the image based information in the component assertion network below. It is also constrained by the lawful object orderings in the model memory network above. The system design is consistent with current psychological theories of recognition by component. It also seems to support Marr's notions of hierarchical indexing. (i.e. the specificity, adjunct, and parent indices) It supports the notion that multiple canonical views of an object may have to be stored in memory to enable its efficient identification. The use of variable fields in the state space vectors appears to keep the number of required nodes in the network down to a tractable number while imposing a semantic value on different areas of the state space. This semantic imposition supports an interface between the analogical aspects of neural networks and the propositional paradigms of symbolic processing.
The Mind and Brain of Short-Term Memory
Jonides, John; Lewis, Richard L.; Nee, Derek Evan; Lustig, Cindy A.; Berman, Marc G.; Moore, Katherine Sledge
2014-01-01
The past 10 years have brought near-revolutionary changes in psychological theories about short-term memory, with similarly great advances in the neurosciences. Here, we critically examine the major psychological theories (the “mind”) of short-term memory and how they relate to evidence about underlying brain mechanisms. We focus on three features that must be addressed by any satisfactory theory of short-term memory. First, we examine the evidence for the architecture of short-term memory, with special attention to questions of capacity and how—or whether—short-term memory can be separated from long-term memory. Second, we ask how the components of that architecture enact processes of encoding, maintenance, and retrieval. Third, we describe the debate over the reason about forgetting from short-term memory, whether interference or decay is the cause. We close with a conceptual model tracing the representation of a single item through a short-term memory task, describing the biological mechanisms that might support psychological processes on a moment-by-moment basis as an item is encoded, maintained over a delay with some forgetting, and ultimately retrieved. PMID:17854286
Cooperative Data Sharing: Simple Support for Clusters of SMP Nodes
NASA Technical Reports Server (NTRS)
DiNucci, David C.; Balley, David H. (Technical Monitor)
1997-01-01
Libraries like PVM and MPI send typed messages to allow for heterogeneous cluster computing. Lower-level libraries, such as GAM, provide more efficient access to communication by removing the need to copy messages between the interface and user space in some cases. still lower-level interfaces, such as UNET, get right down to the hardware level to provide maximum performance. However, these are all still interfaces for passing messages from one process to another, and have limited utility in a shared-memory environment, due primarily to the fact that message passing is just another term for copying. This drawback is made more pertinent by today's hybrid architectures (e.g. clusters of SMPs), where it is difficult to know beforehand whether two communicating processes will share memory. As a result, even portable language tools (like HPF compilers) must either map all interprocess communication, into message passing with the accompanying performance degradation in shared memory environments, or they must check each communication at run-time and implement the shared-memory case separately for efficiency. Cooperative Data Sharing (CDS) is a single user-level API which abstracts all communication between processes into the sharing and access coordination of memory regions, in a model which might be described as "distributed shared messages" or "large-grain distributed shared memory". As a result, the user programs to a simple latency-tolerant abstract communication specification which can be mapped efficiently to either a shared-memory or message-passing based run-time system, depending upon the available architecture. Unlike some distributed shared memory interfaces, the user still has complete control over the assignment of data to processors, the forwarding of data to its next likely destination, and the queuing of data until it is needed, so even the relatively high latency present in clusters can be accomodated. CDS does not require special use of an MMU, which can add overhead to some DSM systems, and does not require an SPMD programming model. unlike some message-passing interfaces, CDS allows the user to implement efficient demand-driven applications where processes must "fight" over data, and does not perform copying if processes share memory and do not attempt concurrent writes. CDS also supports heterogeneous computing, dynamic process creation, handlers, and a very simple thread-arbitration mechanism. Additional support for array subsections is currently being considered. The CDS1 API, which forms the kernel of CDS, is built primarily upon only 2 communication primitives, one process initiation primitive, and some data translation (and marshalling) routines, memory allocation routines, and priority control routines. The entire current collection of 28 routines provides enough functionality to implement most (or all) of MPI 1 and 2, which has a much larger interface consisting of hundreds of routines. still, the API is small enough to consider integrating into standard os interfaces for handling inter-process communication in a network-independent way. This approach would also help to solve many of the problems plaguing other higher-level standards such as MPI and PVM which must, in some cases, "play OS" to adequately address progress and process control issues. The CDS2 API, a higher level of interface roughly equivalent in functionality to MPI and to be built entirely upon CDS1, is still being designed. It is intended to add support for the equivalent of communicators, reduction and other collective operations, process topologies, additional support for process creation, and some automatic memory management. CDS2 will not exactly match MPI, because the copy-free semantics of communication from CDS1 will be supported. CDS2 application programs will be free to carefully also use CDS1. CDS1 has been implemented on networks of workstations running unmodified Unix-based operating systems, using UDP/IP and vendor-supplied high- performance locks. Although its inter-node performance is currently unimpressive due to rudimentary implementation technique, it even now outperforms highly-optimized MPI implementation on intra-node communication due to its support for non-copy communication. The similarity of the CDS1 architecture to that of other projects such as UNET and TRAP suggests that the inter-node performance can be increased significantly to surpass MPI or PVM, and it may be possible to migrate some of its functionality to communication controllers.
Working Memory and Reasoning: The Processing Loads Imposed by Analogies.
ERIC Educational Resources Information Center
Halford, Graeme S.
The proposals concerning working memory outlined in this paper involve the architecture of working memory, the reasoning mechanisms that draw on it, and the ways in which working memory may develop with age. Ways of assessing task demands and children's working memory capacities are also considered. It is noted that there is long-standing evidence…
NASA Astrophysics Data System (ADS)
Huffman, James Douglas
2001-11-01
The most important issue facing the future business success of the Digital Micromirror Device or DMD™ produced by Texas Instruments is the cost of the actual device. As the business and consumer markets call for higher resolution displays, the array size will have to be increased to incorporate more pixels. The manufacturing costs associated with building these higher resolution displays follow an exponential relation with the number of pixels due to yield loss and reduced number of chips per silicon wafer. Each pixel is actuated by electrostatics that are provided by a memory cell that is built in the underlying silicon substrate. One way to decrease cost of the wafer is to change the memory cell architecture from a static random access configuration or SRAM to a dynamic random access configuration or DRAM. This change has the benefits of having fewer components per area and a lower metal density. This reduction in the component count and metal density has a dramatic effect on the yield of the memory array by reducing the particle sensitivity of the underlying cell. The main drawback to using a DRAM configuration in a display application is the light sensitivity of a charge storage device built in the silicon substrate. As the photons pass through the mechanical micromirrors and illuminate the DRAM cell, the effective electrostatic potential of the memory element used for the mirror actuation is reduced. This dissertation outlines the issues associated with the light sensitivity of a DRAM memory cell as the actuation element for a micromirror. The concept of charge depletion on a silicon capacitor due to recombination of photogenerated carriers is explored and experimentally verified. The effects of the reduced potential on the capacitor on the micromirror are also explored. Optical modeling is used to determine the incoming photon flux to determine the benefits of adding a charge recombination region as part of the DRAM memory cell. Several options are explored to reduce the effect of the incoming photons on the potential of the memory cell. The results will show that a 1T1C memory cell with N-type recombination regions and maximum light shielding is sufficient for a projector application.
Parameterized Micro-benchmarking: An Auto-tuning Approach for Complex Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ma, Wenjing; Krishnamoorthy, Sriram; Agrawal, Gagan
2012-05-15
Auto-tuning has emerged as an important practical method for creating highly optimized implementations of key computational kernels and applications. However, the growing complexity of architectures and applications is creating new challenges for auto-tuning. Complex applications can involve a prohibitively large search space that precludes empirical auto-tuning. Similarly, architectures are becoming increasingly complicated, making it hard to model performance. In this paper, we focus on the challenge to auto-tuning presented by applications with a large number of kernels and kernel instantiations. While these kernels may share a somewhat similar pattern, they differ considerably in problem sizes and the exact computation performed.more » We propose and evaluate a new approach to auto-tuning which we refer to as parameterized micro-benchmarking. It is an alternative to the two existing classes of approaches to auto-tuning: analytical model-based and empirical search-based. Particularly, we argue that the former may not be able to capture all the architectural features that impact performance, whereas the latter might be too expensive for an application that has several different kernels. In our approach, different expressions in the application, different possible implementations of each expression, and the key architectural features, are used to derive a simple micro-benchmark and a small parameter space. This allows us to learn the most significant features of the architecture that can impact the choice of implementation for each kernel. We have evaluated our approach in the context of GPU implementations of tensor contraction expressions encountered in excited state calculations in quantum chemistry. We have focused on two aspects of GPUs that affect tensor contraction execution: memory access patterns and kernel consolidation. Using our parameterized micro-benchmarking approach, we obtain a speedup of up to 2 over the version that used default optimizations, but no auto-tuning. We demonstrate that observations made from microbenchmarks match the behavior seen from real expressions. In the process, we make important observations about the memory hierarchy of two of the most recent NVIDIA GPUs, which can be used in other optimization frameworks as well.« less
Architecture for Multiple Interacting Robot Intelligences
NASA Technical Reports Server (NTRS)
Peters, Richard Alan, II (Inventor)
2008-01-01
An architecture for robot intelligence enables a robot to learn new behaviors and create new behavior sequences autonomously and interact with a dynamically changing environment. Sensory information is mapped onto a Sensory Ego-Sphere (SES) that rapidly identifies important changes in the environment and functions much like short term memory. Behaviors are stored in a database associative memory (DBAM) that creates an active map from the robot's current state to a goal state and functions much like long term memory. A dream state converts recent activities stored in the SES and creates or modifies behaviors in the DBAM.
NASA Astrophysics Data System (ADS)
Yamamoto, Shuu'ichirou; Shuto, Yusuke; Sugahara, Satoshi
2013-07-01
We computationally analyzed performance and power-gating (PG) ability of a new nonvolatile delay flip-flop (NV-DFF) based on pseudo-spin-MOSFET (PS-MOSFET) architecture using spin-transfer-torque magnetic tunnel junctions (STT-MTJs). The high-performance energy-efficient PG operations of the NV-DFF can be achieved owing to its cell structure employing PS-MOSFETs that can electrically separate the STT-MTJs from the ordinary DFF part of the NV-DFF. This separation also makes it possible that the break-even time (BET) of the NV-DFF is designed by the size of the PS-MOSFETs without performance degradation of the normal DFF operations. The effect of the area occupation ratio of the NV-DFFs to a CMOS logic system on the BET was also analyzed. Although the optimized BET was varied depending on the area occupation ratio, energy-efficient fine-grained PG with a BET of several sub-microseconds was revealed to be achieved. We also proposed microprocessors and system-on-chip (SoC) devices using nonvolatile hierarchical-memory systems wherein NV-DFF and nonvolatile static random access memory (NV-SRAM) circuits are used as fundamental building blocks. Contribution to the Topical Issue “International Semiconductor Conference Dresden-Grenoble - ISCDG 2012”, Edited by Gérard Ghibaudo, Francis Balestra and Simon Deleonibus.
Parallel, Distributed Scripting with Python
DOE Office of Scientific and Technical Information (OSTI.GOV)
Miller, P J
2002-05-24
Parallel computers used to be, for the most part, one-of-a-kind systems which were extremely difficult to program portably. With SMP architectures, the advent of the POSIX thread API and OpenMP gave developers ways to portably exploit on-the-box shared memory parallelism. Since these architectures didn't scale cost-effectively, distributed memory clusters were developed. The associated MPI message passing libraries gave these systems a portable paradigm too. Having programmers effectively use this paradigm is a somewhat different question. Distributed data has to be explicitly transported via the messaging system in order for it to be useful. In high level languages, the MPI librarymore » gives access to data distribution routines in C, C++, and FORTRAN. But we need more than that. Many reasonable and common tasks are best done in (or as extensions to) scripting languages. Consider sysadm tools such as password crackers, file purgers, etc ... These are simple to write in a scripting language such as Python (an open source, portable, and freely available interpreter). But these tasks beg to be done in parallel. Consider the a password checker that checks an encrypted password against a 25,000 word dictionary. This can take around 10 seconds in Python (6 seconds in C). It is trivial to parallelize if you can distribute the information and co-ordinate the work.« less
VIRTUAL FRAME BUFFER INTERFACE
NASA Technical Reports Server (NTRS)
Wolfe, T. L.
1994-01-01
Large image processing systems use multiple frame buffers with differing architectures and vendor supplied user interfaces. This variety of architectures and interfaces creates software development, maintenance, and portability problems for application programs. The Virtual Frame Buffer Interface program makes all frame buffers appear as a generic frame buffer with a specified set of characteristics, allowing programmers to write code which will run unmodified on all supported hardware. The Virtual Frame Buffer Interface converts generic commands to actual device commands. The virtual frame buffer consists of a definition of capabilities and FORTRAN subroutines that are called by application programs. The virtual frame buffer routines may be treated as subroutines, logical functions, or integer functions by the application program. Routines are included that allocate and manage hardware resources such as frame buffers, monitors, video switches, trackballs, tablets and joysticks; access image memory planes; and perform alphanumeric font or text generation. The subroutines for the various "real" frame buffers are in separate VAX/VMS shared libraries allowing modification, correction or enhancement of the virtual interface without affecting application programs. The Virtual Frame Buffer Interface program was developed in FORTRAN 77 for a DEC VAX 11/780 or a DEC VAX 11/750 under VMS 4.X. It supports ADAGE IK3000, DEANZA IP8500, Low Resolution RAMTEK 9460, and High Resolution RAMTEK 9460 Frame Buffers. It has a central memory requirement of approximately 150K. This program was developed in 1985.
Mozaffari, Brian
2014-01-01
Based on the notion that the brain is equipped with a hierarchical organization, which embodies environmental contingencies across many time scales, this paper suggests that the medial temporal lobe (MTL)-located deep in the hierarchy-serves as a bridge connecting supra- to infra-MTL levels. Bridging the upper and lower regions of the hierarchy provides a parallel architecture that optimizes information flow between upper and lower regions to aid attention, encoding, and processing of quick complex visual phenomenon. Bypassing intermediate hierarchy levels, information conveyed through the MTL "bridge" allows upper levels to make educated predictions about the prevailing context and accordingly select lower representations to increase the efficiency of predictive coding throughout the hierarchy. This selection or activation/deactivation is associated with endogenous attention. In the event that these "bridge" predictions are inaccurate, this architecture enables the rapid encoding of novel contingencies. A review of hierarchical models in relation to memory is provided along with a new theory, Medial-temporal-lobe Conduit for Parallel Connectivity (MCPC). In this scheme, consolidation is considered as a secondary process, occurring after a MTL-bridged connection, which eventually allows upper and lower levels to access each other directly. With repeated reactivations, as contingencies become consolidated, less MTL activity is predicted. Finally, MTL bridging may aid processing transient but structured perceptual events, by allowing communication between upper and lower levels without calling on intermediate levels of representation.
Lee, Ke-Jing; Chang, Yu-Chi; Lee, Cheng-Jung; Wang, Li-Wen; Wang, Yeong-Her
2017-01-01
A one-transistor and one-resistor (1T1R) architecture with a resistive random access memory (RRAM) cell connected to an organic thin-film transistor (OTFT) device is successfully demonstrated to avoid the cross-talk issues of only one RRAM cell. The OTFT device, which uses barium zirconate nickelate (BZN) as a dielectric layer, exhibits favorable electrical properties, such as a high field-effect mobility of 2.5 cm2/Vs, low threshold voltage of −2.8 V, and low leakage current of 10−12 A, for a driver in the 1T1R operation scheme. The 1T1R architecture with a TiO2-based RRAM cell connected with a BZN OTFT device indicates a low operation current (10 μA) and reliable data retention (over ten years). This favorable performance of the 1T1R device can be attributed to the additional barrier heights introduced by using Ni (II) acetylacetone as a substitute for acetylacetone, and the relatively low leakage current of a BZN dielectric layer. The proposed 1T1R device with low leakage current OTFT and excellent uniform resistance distribution of RRAM exhibits a good potential for use in practical low-power electronic applications. PMID:29232828
Controllable Organic Resistive Switching Achieved by One-Step Integration of Cone-Shaped Contact.
Ling, Haifeng; Yi, Mingdong; Nagai, Masaru; Xie, Linghai; Wang, Laiyuan; Hu, Bo; Huang, Wei
2017-09-01
Conductive filaments (CFs)-based resistive random access memory possesses the ability of scaling down to sub-nanoscale with high-density integration architecture, making it the most promising nanoelectronic technology for reclaiming Moore's law. Compared with the extensive study in inorganic switching medium, the scientific challenge now is to understand the growth kinetics of nanoscale CFs in organic polymers, aiming to achieve controllable switching characteristics toward flexible and reliable nonvolatile organic memory. Here, this paper systematically investigates the resistive switching (RS) behaviors based on a widely adopted vertical architecture of Al/organic/indium-tin-oxide (ITO), with poly(9-vinylcarbazole) as the case study. A nanoscale Al filament with a dynamic-gap zone (DGZ) is directly observed using in situ scanning transmission electron microscopy (STEM) , which demonstrates that the RS behaviors are related to the random formation of spliced filaments consisting of Al and oxygen vacancy dual conductive channels growing through carbazole groups. The randomicity of the filament formation can be depressed by introducing a cone-shaped contact via a one-step integration method. The conical electrode can effectively shorten the DGZ and enhance the localized electric field, thus reducing the switching voltage and improving the RS uniformity. This study provides a deeper insight of the multiple filamentary mechanisms for organic RS effect. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
SERVER DEVELOPMENT FOR NSLS-II PHYSICS APPLICATIONS AND PERFORMANCE ANALYSIS
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shen, G.; Kraimer, M.
2011-03-28
The beam commissioning software framework of NSLS-II project adopts a client/server based architecture to replace the more traditional monolithic high level application approach. The server software under development is available via an open source sourceforge project named epics-pvdata, which consists of modules pvData, pvAccess, pvIOC, and pvService. Examples of two services that already exist in the pvService module are itemFinder, and gather. Each service uses pvData to store in-memory transient data, pvService to transfer data over the network, and pvIOC as the service engine. The performance benchmarking for pvAccess and both gather service and item finder service are presented inmore » this paper. The performance comparison between pvAccess and Channel Access are presented also. For an ultra low emittance synchrotron radiation light source like NSLS II, the control system requirements, especially for beam control are tight. To control and manipulate the beam effectively, a use case study has been performed to satisfy the requirement and theoretical evaluation has been performed. The analysis shows that model based control is indispensable for beam commissioning and routine operation. However, there are many challenges such as how to re-use a design model for on-line model based control, and how to combine the numerical methods for modeling of a realistic lattice with the analytical techniques for analysis of its properties. To satisfy the requirements and challenges, adequate system architecture for the software framework for beam commissioning and operation is critical. The existing traditional approaches are self-consistent, and monolithic. Some of them have adopted a concept of middle layer to separate low level hardware processing from numerical algorithm computing, physics modelling, data manipulating and plotting, and error handling. However, none of the existing approaches can satisfy the requirement. A new design has been proposed by introducing service oriented architecture technology, and client interface is undergoing. The design and implementation adopted a new EPICS implementation, namely epics-pvdata [9], which is under active development. The implementation of this project under Java is close to stable, and binding to other language such as C++ and/or Python is undergoing. In this paper, we focus on the performance benchmarking and comparison for pvAccess and Channel Access, the performance evaluation for 2 services, gather and item finder respectively.« less
Scheduling for Locality in Shared-Memory Multiprocessors
1993-05-01
Submitted in Partial Fulfillment of the Requirements for the Degree ’)iIC Q(JALfryT INSPECTED 5 DOCTOR OF PHILOSOPHY I Accesion For Supervised by NTIS CRAM... architecture on parallel program performance, explain the implications of this trend on popular parallel programming models, and propose system software to 0...decomoosition and scheduling algorithms. I. SUIUECT TERMS IS. NUMBER OF PAGES shared-memory multiprocessors; architecture trends; loop 110 scheduling
A Low-Cost and Energy-Efficient Multiprocessor System-on-Chip for UWB MAC Layer
NASA Astrophysics Data System (ADS)
Xiao, Hao; Isshiki, Tsuyoshi; Khan, Arif Ullah; Li, Dongju; Kunieda, Hiroaki; Nakase, Yuko; Kimura, Sadahiro
Ultra-wideband (UWB) technology has attracted much attention recently due to its high data rate and low emission power. Its media access control (MAC) protocol, WiMedia MAC, promises a lot of facilities for high-speed and high-quality wireless communication. However, these benefits in turn involve a large amount of computational load, which challenges the traditional uniprocessor architecture based implementation method to provide the required performance. However, the constrained cost and power budget, on the other hand, makes using commercial multiprocessor solutions unrealistic. In this paper, a low-cost and energy-efficient multiprocessor system-on-chip (MPSoC), which tackles at once the aspects of system design, software migration and hardware architecture, is presented for the implementation of UWB MAC layer. Experimental results show that the proposed MPSoC, based on four simple RISC processors and shared-memory infrastructure, achieves up to 45% performance improvement and 65% power saving, but takes 15% less area than the uniprocessor implementation.
The design of an adaptive predictive coder using a single-chip digital signal processor
NASA Astrophysics Data System (ADS)
Randolph, M. A.
1985-01-01
A speech coding processor architecture design study has been performed in which Texas Instruments TMS32010 has been selected from among three commercially available digital signal processing integrated circuits and evaluated in an implementation study of real-time Adaptive Predictive Coding (APC). The TMS32010 has been compared with AR&T Bell Laboratories DSP I and Nippon Electric Co. PD7720 and was found to be most suitable for a single chip implementation of APC. A preliminary design system based on TMS32010 has been performed, and several of the hardware and software design issues are discussed. Particular attention was paid to the design of an external memory controller which permits rapid sequential access of external RAM. As a result, it has been determined that a compact hardware implementation of the APC algorithm is feasible based of the TSM32010. Originator-supplied keywords include: vocoders, speech compression, adaptive predictive coding, digital signal processing microcomputers, speech processor architectures, and special purpose processor.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wasserman, H.J.
1996-02-01
The second generation of the Digital Equipment Corp. (DEC) DECchip Alpha AXP microprocessor is referred to as the 21164. From the viewpoint of numerically-intensive computing, the primary difference between it and its predecessor, the 21064, is that the 21164 has twice the multiply/add throughput per clock period (CP), a maximum of two floating point operations (FLOPS) per CP vs. one for 21064. The AlphaServer 8400 is a shared-memory multiprocessor server system that can accommodate up to 12 CPUs and up to 14 GB of memory. In this report we will compare single processor performance of the 8400 system with thatmore » of the International Business Machines Corp. (IBM) RISC System/6000 POWER-2 microprocessor running at 66 MHz, the Silicon Graphics, Inc. (SGI) MIPS R8000 microprocessor running at 75 MHz, and the Cray Research, Inc. CRAY J90. The performance comparison is based on a set of Fortran benchmark codes that represent a portion of the Los Alamos National Laboratory supercomputer workload. The advantage of using these codes, is that the codes also span a wide range of computational characteristics, such as vectorizability, problem size, and memory access pattern. The primary disadvantage of using them is that detailed, quantitative analysis of performance behavior of all codes on all machines is difficult. One important addition to the benchmark set appears for the first time in this report. Whereas the older version was written for a vector processor, the newer version is more optimized for microprocessor architectures. Therefore, we have for the first time, an opportunity to measure performance on a single application using implementations that expose the respective strengths of vector and superscalar architecture. All results in this report are from single processors. A subsequent article will explore shared-memory multiprocessing performance of the 8400 system.« less
NASA Astrophysics Data System (ADS)
Noé, Pierre; Vallée, Christophe; Hippert, Françoise; Fillot, Frédéric; Raty, Jean-Yves
2018-01-01
Chalcogenide phase-change materials (PCMs), such as Ge-Sb-Te alloys, have shown outstanding properties, which has led to their successful use for a long time in optical memories (DVDs) and, recently, in non-volatile resistive memories. The latter, known as PCM memories or phase-change random access memories (PCRAMs), are the most promising candidates among emerging non-volatile memory (NVM) technologies to replace the current FLASH memories at CMOS technology nodes under 28 nm. Chalcogenide PCMs exhibit fast and reversible phase transformations between crystalline and amorphous states with very different transport and optical properties leading to a unique set of features for PCRAMs, such as fast programming, good cyclability, high scalability, multi-level storage capability, and good data retention. Nevertheless, PCM memory technology has to overcome several challenges to definitively invade the NVM market. In this review paper, we examine the main technological challenges that PCM memory technology must face and we illustrate how new memory architecture, innovative deposition methods, and PCM composition optimization can contribute to further improvements of this technology. In particular, we examine how to lower the programming currents and increase data retention. Scaling down PCM memories for large-scale integration means the incorporation of the PCM into more and more confined structures and raises materials science issues in order to understand interface and size effects on crystallization. Other materials science issues are related to the stability and ageing of the amorphous state of PCMs. The stability of the amorphous phase, which determines data retention in memory devices, can be increased by doping the PCM. Ageing of the amorphous phase leads to a large increase of the resistivity with time (resistance drift), which has up to now hindered the development of ultra-high multi-level storage devices. A review of the current understanding of all these issues is provided from a materials science point of view.
NASA Astrophysics Data System (ADS)
Irving, D. H.; Rasheed, M.; O'Doherty, N.
2010-12-01
The efficient storage, retrieval and interactive use of subsurface data present great challenges in geodata management. Data volumes are typically massive, complex and poorly indexed with inadequate metadata. Derived geomodels and interpretations are often tightly bound in application-centric and proprietary formats; open standards for long-term stewardship are poorly developed. Consequently current data storage is a combination of: complex Logical Data Models (LDMs) based on file storage formats; 2D GIS tree-based indexing of spatial data; and translations of serialised memory-based storage techniques into disk-based storage. Whilst adequate for working at the mesoscale over a short timeframes, these approaches all possess technical and operational shortcomings: data model complexity; anisotropy of access; scalability to large and complex datasets; and weak implementation and integration of metadata. High performance hardware such as parallelised storage and Relational Database Management System (RDBMS) have long been exploited in many solutions but the underlying data structure must provide commensurate efficiencies to allow multi-user, multi-application and near-realtime data interaction. We present an open Spatially-Registered Data Structure (SRDS) built on Massively Parallel Processing (MPP) database architecture implemented by a ANSI SQL 2008 compliant RDBMS. We propose a LDM comprising a 3D Earth model that is decomposed such that each increasing Level of Detail (LoD) is achieved by recursively halving the bin size until it is less than the error in each spatial dimension for that data point. The value of an attribute at that point is stored as a property of that point and at that LoD. It is key to the numerical efficiency of the SRDS that it is under-pinned by a power-of-two relationship thus precluding the need for computationally intensive floating point arithmetic. Our approach employed a tightly clustered MPP array with small clusters of storage, processors and memory communicating over a high-speed network inter-connect. This is a shared-nothing architecture where resources are managed within each cluster unlike most other RDBMSs. Data are accessed on this architecture by their primary index values which utilises the hashing algorithm for point-to-point access. The hashing algorithm’s main role is the efficient distribution of data across the clusters based on the primary index. In this study we used 3D seismic volumes, 2D seismic profiles and borehole logs to demonstrate application in both (x,y,TWT) and (x,y,z)-space. In the SRDS the primary index is a composite column index of (x,y) to avoid invoking time-consuming full table scans as is the case in tree-based systems. This means that data access is isotropic. A query for data in a specified spatial range permits retrieval recursively by point-to-point queries within each nested LoD yielding true linear performance up to the Petabyte scale with hardware scaling presenting the primary limiting factor. Our architecture and LDM promotes: realtime interaction with massive data volumes; streaming of result sets and server-rendered 2D/3D imagery; rigorous workflow control and auditing; and in-database algorithms run directly against data as a HPC cloud service.
Exterior view, westsouthwest, of Jeudevine Memorial Library. Built 18961897 and ...
Exterior view, west-southwest, of Jeudevine Memorial Library. Built 1896-1897 and designed by local architect Lambert Packard, the library is an excellent example of Richardsonian Romanesque architecture. - Jeudevine Memorial Library, 93 North Main Street, Hardwick, Caledonia County, VT
Sparse distributed memory overview
NASA Technical Reports Server (NTRS)
Raugh, Mike
1990-01-01
The Sparse Distributed Memory (SDM) project is investigating the theory and applications of massively parallel computing architecture, called sparse distributed memory, that will support the storage and retrieval of sensory and motor patterns characteristic of autonomous systems. The immediate objectives of the project are centered in studies of the memory itself and in the use of the memory to solve problems in speech, vision, and robotics. Investigation of methods for encoding sensory data is an important part of the research. Examples of NASA missions that may benefit from this work are Space Station, planetary rovers, and solar exploration. Sparse distributed memory offers promising technology for systems that must learn through experience and be capable of adapting to new circumstances, and for operating any large complex system requiring automatic monitoring and control. Sparse distributed memory is a massively parallel architecture motivated by efforts to understand how the human brain works. Sparse distributed memory is an associative memory, able to retrieve information from cues that only partially match patterns stored in the memory. It is able to store long temporal sequences derived from the behavior of a complex system, such as progressive records of the system's sensory data and correlated records of the system's motor controls.
Radiation-Hardened Solid-State Drive
NASA Technical Reports Server (NTRS)
Sheldon, Douglas J.
2010-01-01
A method is provided for a radiationhardened (rad-hard) solid-state drive for space mission memory applications by combining rad-hard and commercial off-the-shelf (COTS) non-volatile memories (NVMs) into a hybrid architecture. The architecture is controlled by a rad-hard ASIC (application specific integrated circuit) or a FPGA (field programmable gate array). Specific error handling and data management protocols are developed for use in a rad-hard environment. The rad-hard memories are smaller in overall memory density, but are used to control and manage radiation-induced errors in the main, and much larger density, non-rad-hard COTS memory devices. Small amounts of rad-hard memory are used as error buffers and temporary caches for radiation-induced errors in the large COTS memories. The rad-hard ASIC/FPGA implements a variety of error-handling protocols to manage these radiation-induced errors. The large COTS memory is triplicated for protection, and CRC-based counters are calculated for sub-areas in each COTS NVM array. These counters are stored in the rad-hard non-volatile memory. Through monitoring, rewriting, regeneration, triplication, and long-term storage, radiation-induced errors in the large NV memory are managed. The rad-hard ASIC/FPGA also interfaces with the external computer buses.
Method and system for training dynamic nonlinear adaptive filters which have embedded memory
NASA Technical Reports Server (NTRS)
Rabinowitz, Matthew (Inventor)
2002-01-01
Described herein is a method and system for training nonlinear adaptive filters (or neural networks) which have embedded memory. Such memory can arise in a multi-layer finite impulse response (FIR) architecture, or an infinite impulse response (IIR) architecture. We focus on filter architectures with separate linear dynamic components and static nonlinear components. Such filters can be structured so as to restrict their degrees of computational freedom based on a priori knowledge about the dynamic operation to be emulated. The method is detailed for an FIR architecture which consists of linear FIR filters together with nonlinear generalized single layer subnets. For the IIR case, we extend the methodology to a general nonlinear architecture which uses feedback. For these dynamic architectures, we describe how one can apply optimization techniques which make updates closer to the Newton direction than those of a steepest descent method, such as backpropagation. We detail a novel adaptive modified Gauss-Newton optimization technique, which uses an adaptive learning rate to determine both the magnitude and direction of update steps. For a wide range of adaptive filtering applications, the new training algorithm converges faster and to a smaller value of cost than both steepest-descent methods such as backpropagation-through-time, and standard quasi-Newton methods. We apply the algorithm to modeling the inverse of a nonlinear dynamic tracking system 5, as well as a nonlinear amplifier 6.
78 FR 74056 - Rail Vehicles Access Advisory Committee
Federal Register 2010, 2011, 2012, 2013, 2014
2013-12-10
...-0001] RIN 3014-AA42 Rail Vehicles Access Advisory Committee AGENCY: Architectural and Transportation... Architectural and Transportation Barriers Compliance Board (Access Board), established the Rail Vehicle Access... pursuant to the Americans with Disabilities Act for transportation vehicles that operate on fixed guideway...
Yoon, Doe Hyun; Muralimanohar, Naveen; Chang, Jichuan; Ranganthan, Parthasarathy
2017-09-26
A disclosed example method involves performing simultaneous data accesses on at least first and second independently selectable logical sub-ranks to access first data via a wide internal data bus in a memory device. The memory device includes a translation buffer chip, memory chips in independently selectable logical sub-ranks, a narrow external data bus to connect the translation buffer chip to a memory controller, and the wide internal data bus between the translation buffer chip and the memory chips. A data access is performed on only the first independently selectable logical sub-rank to access second data via the wide internal data bus. The example method also involves locating a first portion of the first data, a second portion of the first data, and the second data on the narrow external data bus during separate data transfers.
2014-09-30
Mental Domain = Ω Goal Management goal change goal input World =Ψ Memory Mission & Goals( ) World Model (-Ψ) Episodic Memory Semantic Memory ...Activations Trace Meta-Level Control Introspective Monitoring Memory Reasoning Trace ( ) Strategies Episodic Memory Metaknowledge Self Model...it is from incorrect or missing memory associations (i.e., indices). Similarly, correct information may exist in the input stream, but may not be
Three-dimensional integration of nanotechnologies for computing and data storage on a single chip
NASA Astrophysics Data System (ADS)
Shulaker, Max M.; Hills, Gage; Park, Rebecca S.; Howe, Roger T.; Saraswat, Krishna; Wong, H.-S. Philip; Mitra, Subhasish
2017-07-01
The computing demands of future data-intensive applications will greatly exceed the capabilities of current electronics, and are unlikely to be met by isolated improvements in transistors, data storage technologies or integrated circuit architectures alone. Instead, transformative nanosystems, which use new nanotechnologies to simultaneously realize improved devices and new integrated circuit architectures, are required. Here we present a prototype of such a transformative nanosystem. It consists of more than one million resistive random-access memory cells and more than two million carbon-nanotube field-effect transistors—promising new nanotechnologies for use in energy-efficient digital logic circuits and for dense data storage—fabricated on vertically stacked layers in a single chip. Unlike conventional integrated circuit architectures, the layered fabrication realizes a three-dimensional integrated circuit architecture with fine-grained and dense vertical connectivity between layers of computing, data storage, and input and output (in this instance, sensing). As a result, our nanosystem can capture massive amounts of data every second, store it directly on-chip, perform in situ processing of the captured data, and produce ‘highly processed’ information. As a working prototype, our nanosystem senses and classifies ambient gases. Furthermore, because the layers are fabricated on top of silicon logic circuitry, our nanosystem is compatible with existing infrastructure for silicon-based technologies. Such complex nano-electronic systems will be essential for future high-performance and highly energy-efficient electronic systems.
Three-dimensional integration of nanotechnologies for computing and data storage on a single chip.
Shulaker, Max M; Hills, Gage; Park, Rebecca S; Howe, Roger T; Saraswat, Krishna; Wong, H-S Philip; Mitra, Subhasish
2017-07-05
The computing demands of future data-intensive applications will greatly exceed the capabilities of current electronics, and are unlikely to be met by isolated improvements in transistors, data storage technologies or integrated circuit architectures alone. Instead, transformative nanosystems, which use new nanotechnologies to simultaneously realize improved devices and new integrated circuit architectures, are required. Here we present a prototype of such a transformative nanosystem. It consists of more than one million resistive random-access memory cells and more than two million carbon-nanotube field-effect transistors-promising new nanotechnologies for use in energy-efficient digital logic circuits and for dense data storage-fabricated on vertically stacked layers in a single chip. Unlike conventional integrated circuit architectures, the layered fabrication realizes a three-dimensional integrated circuit architecture with fine-grained and dense vertical connectivity between layers of computing, data storage, and input and output (in this instance, sensing). As a result, our nanosystem can capture massive amounts of data every second, store it directly on-chip, perform in situ processing of the captured data, and produce 'highly processed' information. As a working prototype, our nanosystem senses and classifies ambient gases. Furthermore, because the layers are fabricated on top of silicon logic circuitry, our nanosystem is compatible with existing infrastructure for silicon-based technologies. Such complex nano-electronic systems will be essential for future high-performance and highly energy-efficient electronic systems.
Multilevel Resistance Programming in Conductive Bridge Resistive Memory
NASA Astrophysics Data System (ADS)
Mahalanabis, Debayan
This work focuses on the existence of multiple resistance states in a type of emerging non-volatile resistive memory device known commonly as Programmable Metallization Cell (PMC) or Conductive Bridge Random Access Memory (CBRAM), which can be important for applications such as multi-bit memory as well as non-volatile logic and neuromorphic computing. First, experimental data from small signal, quasi-static and pulsed mode electrical characterization of such devices are presented which clearly demonstrate the inherent multi-level resistance programmability property in CBRAM devices. A physics based analytical CBRAM compact model is then presented which simulates the ion-transport dynamics and filamentary growth mechanism that causes resistance change in such devices. Simulation results from the model are fitted to experimental dynamic resistance switching characteristics. The model designed using Verilog-a language is computation-efficient and can be integrated with industry standard circuit simulation tools for design and analysis of hybrid circuits involving both CMOS and CBRAM devices. Three main circuit applications for CBRAM devices are explored in this work. Firstly, the susceptibility of CBRAM memory arrays to single event induced upsets is analyzed via compact model simulation and experimental heavy ion testing data that show possibility of both high resistance to low resistance and low resistance to high resistance transitions due to ion strikes. Next, a non-volatile sense amplifier based flip-flop architecture is proposed which can help make leakage power consumption negligible by allowing complete shutdown of power supply while retaining its output data in CBRAM devices. Reliability and energy consumption of the flip-flop circuit for different CBRAM low resistance levels and supply voltage values are analyzed and compared to CMOS designs. Possible extension of this architecture for threshold logic function computation using the CBRAM devices as re-configurable resistive weights is also discussed. Lastly, Spike timing dependent plasticity (STDP) based gradual resistance change behavior in CBRAM device fabricated in back-end-of-line on a CMOS die containing integrate and fire CMOS neuron circuits is demonstrated for the first time which indicates the feasibility of using CBRAM devices as electronic synapses in spiking neural network hardware implementations for non-Boolean neuromorphic computing.
Announcing Supercomputer Summit
Wells, Jack; Bland, Buddy; Nichols, Jeff; Hack, Jim; Foertter, Fernanda; Hagen, Gaute; Maier, Thomas; Ashfaq, Moetasim; Messer, Bronson; Parete-Koon, Suzanne
2018-01-16
Summit is the next leap in leadership-class computing systems for open science. With Summit we will be able to address, with greater complexity and higher fidelity, questions concerning who we are, our place on earth, and in our universe. Summit will deliver more than five times the computational performance of Titanâs 18,688 nodes, using only approximately 3,400 nodes when it arrives in 2017. Like Titan, Summit will have a hybrid architecture, and each node will contain multiple IBM POWER9 CPUs and NVIDIA Volta GPUs all connected together with NVIDIAâs high-speed NVLink. Each node will have over half a terabyte of coherent memory (high bandwidth memory + DDR4) addressable by all CPUs and GPUs plus 800GB of non-volatile RAM that can be used as a burst buffer or as extended memory. To provide a high rate of I/O throughput, the nodes will be connected in a non-blocking fat-tree using a dual-rail Mellanox EDR InfiniBand interconnect. Upon completion, Summit will allow researchers in all fields of science unprecedented access to solving some of the worldâs most pressing challenges.
Physical Realization of a Supervised Learning System Built with Organic Memristive Synapses
NASA Astrophysics Data System (ADS)
Lin, Yu-Pu; Bennett, Christopher H.; Cabaret, Théo; Vodenicarevic, Damir; Chabi, Djaafar; Querlioz, Damien; Jousselme, Bruno; Derycke, Vincent; Klein, Jacques-Olivier
2016-09-01
Multiple modern applications of electronics call for inexpensive chips that can perform complex operations on natural data with limited energy. A vision for accomplishing this is implementing hardware neural networks, which fuse computation and memory, with low cost organic electronics. A challenge, however, is the implementation of synapses (analog memories) composed of such materials. In this work, we introduce robust, fastly programmable, nonvolatile organic memristive nanodevices based on electrografted redox complexes that implement synapses thanks to a wide range of accessible intermediate conductivity states. We demonstrate experimentally an elementary neural network, capable of learning functions, which combines four pairs of organic memristors as synapses and conventional electronics as neurons. Our architecture is highly resilient to issues caused by imperfect devices. It tolerates inter-device variability and an adaptable learning rule offers immunity against asymmetries in device switching. Highly compliant with conventional fabrication processes, the system can be extended to larger computing systems capable of complex cognitive tasks, as demonstrated in complementary simulations.
A Single Chip VLSI Implementation of a QPSK/SQPSK Demodulator for a VSAT Receiver Station
NASA Technical Reports Server (NTRS)
Kwatra, S. C.; King, Brent
1995-01-01
This thesis presents a VLSI implementation of a QPSK/SQPSK demodulator. It is designed to be employed in a VSAT earth station that utilizes the FDMA/TDM link. A single chip architecture is used to enable this chip to be easily employed in the VSAT system. This demodulator contains lowpass filters, integrate and dump units, unique word detectors, a timing recovery unit, a phase recovery unit and a down conversion unit. The design stages start with a functional representation of the system by using the C programming language. Then it progresses into a register based representation using the VHDL language. The layout components are designed based on these VHDL models and simulated. Component generators are developed for the adder, multiplier, read-only memory and serial access memory in order to shorten the design time. These sub-components are then block routed to form the main components of the system. The main components are block routed to form the final demodulator.
Reilly, Jamie; Garcia, Amanda; Binney, Richard J.
2016-01-01
Much remains to be learned about the neural architecture underlying word meaning. Fully distributed models of semantic memory predict that the sound of a barking dog will conjointly engage a network of distributed sensorimotor spokes. An alternative framework holds that modality-specific features additionally converge within transmodal hubs. Participants underwent functional MRI while covertly naming familiar objects versus newly learned novel objects from only one of their constituent semantic features (visual form, characteristic sound, or point-light motion representation). Relative to the novel object baseline, familiar concepts elicited greater activation within association regions specific to that presentation modality. Furthermore, visual form elicited activation within high-level auditory association cortex. Conversely, environmental sounds elicited activation in regions proximal to visual association cortex. Both conditions commonly engaged a putative hub region within lateral anterior temporal cortex. These results support hybrid semantic models in which local hubs and distributed spokes are dually engaged in service of semantic memory. PMID:27289210
Physical Realization of a Supervised Learning System Built with Organic Memristive Synapses.
Lin, Yu-Pu; Bennett, Christopher H; Cabaret, Théo; Vodenicarevic, Damir; Chabi, Djaafar; Querlioz, Damien; Jousselme, Bruno; Derycke, Vincent; Klein, Jacques-Olivier
2016-09-07
Multiple modern applications of electronics call for inexpensive chips that can perform complex operations on natural data with limited energy. A vision for accomplishing this is implementing hardware neural networks, which fuse computation and memory, with low cost organic electronics. A challenge, however, is the implementation of synapses (analog memories) composed of such materials. In this work, we introduce robust, fastly programmable, nonvolatile organic memristive nanodevices based on electrografted redox complexes that implement synapses thanks to a wide range of accessible intermediate conductivity states. We demonstrate experimentally an elementary neural network, capable of learning functions, which combines four pairs of organic memristors as synapses and conventional electronics as neurons. Our architecture is highly resilient to issues caused by imperfect devices. It tolerates inter-device variability and an adaptable learning rule offers immunity against asymmetries in device switching. Highly compliant with conventional fabrication processes, the system can be extended to larger computing systems capable of complex cognitive tasks, as demonstrated in complementary simulations.
NASA Astrophysics Data System (ADS)
Laracuente, Nicholas; Grossman, Carl
2013-03-01
We developed an algorithm and software to calculate autocorrelation functions from real-time photon-counting data using the fast, parallel capabilities of graphical processor units (GPUs). Recent developments in hardware and software have allowed for general purpose computing with inexpensive GPU hardware. These devices are more suited for emulating hardware autocorrelators than traditional CPU-based software applications by emphasizing parallel throughput over sequential speed. Incoming data are binned in a standard multi-tau scheme with configurable points-per-bin size and are mapped into a GPU memory pattern to reduce time-expensive memory access. Applications include dynamic light scattering (DLS) and fluorescence correlation spectroscopy (FCS) experiments. We ran the software on a 64-core graphics pci card in a 3.2 GHz Intel i5 CPU based computer running Linux. FCS measurements were made on Alexa-546 and Texas Red dyes in a standard buffer (PBS). Software correlations were compared to hardware correlator measurements on the same signals. Supported by HHMI and Swarthmore College
NASA Astrophysics Data System (ADS)
Liu, Tianyu; Du, Xining; Ji, Wei; Xu, X. George; Brown, Forrest B.
2014-06-01
For nuclear reactor analysis such as the neutron eigenvalue calculations, the time consuming Monte Carlo (MC) simulations can be accelerated by using graphics processing units (GPUs). However, traditional MC methods are often history-based, and their performance on GPUs is affected significantly by the thread divergence problem. In this paper we describe the development of a newly designed event-based vectorized MC algorithm for solving the neutron eigenvalue problem. The code was implemented using NVIDIA's Compute Unified Device Architecture (CUDA), and tested on a NVIDIA Tesla M2090 GPU card. We found that although the vectorized MC algorithm greatly reduces the occurrence of thread divergence thus enhancing the warp execution efficiency, the overall simulation speed is roughly ten times slower than the history-based MC code on GPUs. Profiling results suggest that the slow speed is probably due to the memory access latency caused by the large amount of global memory transactions. Possible solutions to improve the code efficiency are discussed.
System architecture of a gallium arsenide one-gigahertz digital IC tester
NASA Technical Reports Server (NTRS)
Fouts, Douglas J.; Johnson, John M.; Butner, Steven E.; Long, Stephen I.
1987-01-01
The design for a 1-GHz digital integrated circuit tester for the evaluation of custom GaAs chips and subsystems is discussed. Technology-related problems affecting the design of a GaAs computer are discussed, with emphasis on the problems introduced by long printed-circuit-board interconnect. High-speed interface modules provide a link between the low-speed microprocessor and the chip under test. Memory-multiplexer and memory-shift register architectures for the storage of test vectors are described in addition to an architecture for local data storage consisting of a long chain of GaAs shift registers. The tester is constructed around a VME system card cage and backplane, and very little high-speed interconnect exists between boards. The tester has a three part self-test consisting of a CPU board confidence test, a main memory confidence test, and a high-speed interface module functional test.
An Evaluation of Architectural Platforms for Parallel Navier-Stokes Computations
NASA Technical Reports Server (NTRS)
Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.
1996-01-01
We study the computational, communication, and scalability characteristics of a computational fluid dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architecture platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and distributed memory multiprocessors with different topologies - the IBM SP and the Cray T3D. We investigate the impact of various networks connecting the cluster of workstations on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms
NASA Technical Reports Server (NTRS)
Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.
1997-01-01
We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies-the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
NASA Astrophysics Data System (ADS)
Ho, Wan Ching; Dautenhahn, Kerstin; Nehaniv, Chrystopher
2008-03-01
In this paper, we discuss the concept of autobiographic agent and how memory may extend an agent's temporal horizon and increase its adaptability. These concepts are applied to an implementation of a scenario where agents are interacting in a complex virtual artificial life environment. We present computational memory architectures for autobiographic virtual agents that enable agents to retrieve meaningful information from their dynamic memories which increases their adaptation and survival in the environment. The design of the memory architectures, the agents, and the virtual environment are described in detail. Next, a series of experimental studies and their results are presented which show the adaptive advantage of autobiographic memory, i.e. from remembering significant experiences. Also, in a multi-agent scenario where agents can communicate via stories based on their autobiographic memory, it is found that new adaptive behaviours can emerge from an individual's reinterpretation of experiences received from other agents whereby higher communication frequency yields better group performance. An interface is described that visualises the memory contents of an agent. From an observer perspective, the agents' behaviours can be understood as individually structured, and temporally grounded, and, with the communication of experience, can be seen to rely on emergent mixed narrative reconstructions combining the experiences of several agents. This research leads to insights into how bottom-up story-telling and autobiographic reconstruction in autonomous, adaptive agents allow temporally grounded behaviour to emerge. The article concludes with a discussion of possible implications of this research direction for future autobiographic, narrative agents.
Framewise phoneme classification with bidirectional LSTM and other neural network architectures.
Graves, Alex; Schmidhuber, Jürgen
2005-01-01
In this paper, we present bidirectional Long Short Term Memory (LSTM) networks, and a modified, full gradient version of the LSTM learning algorithm. We evaluate Bidirectional LSTM (BLSTM) and several other network architectures on the benchmark task of framewise phoneme classification, using the TIMIT database. Our main findings are that bidirectional networks outperform unidirectional ones, and Long Short Term Memory (LSTM) is much faster and also more accurate than both standard Recurrent Neural Nets (RNNs) and time-windowed Multilayer Perceptrons (MLPs). Our results support the view that contextual information is crucial to speech processing, and suggest that BLSTM is an effective architecture with which to exploit it.
Low-Power Architectures for Large Radio Astronomy Correlators
NASA Technical Reports Server (NTRS)
D'Addario, Larry R.
2011-01-01
The architecture of a cross-correlator for a synthesis radio telescope with N greater than 1000 antennas is studied with the objective of minimizing power consumption. It is found that the optimum architecture minimizes memory operations, and this implies preference for a matrix structure over a pipeline structure and avoiding the use of memory banks as accumulation registers when sharing multiply-accumulators among baselines. A straw-man design for N = 2000 and bandwidth of 1 GHz, based on ASICs fabricated in a 90 nm CMOS process, is presented. The cross-correlator proper (excluding per-antenna processing) is estimated to consume less than 35 kW.
On the Performance of an Algebraic MultigridSolver on Multicore Clusters
DOE Office of Scientific and Technical Information (OSTI.GOV)
Baker, A H; Schulz, M; Yang, U M
2010-04-29
Algebraic multigrid (AMG) solvers have proven to be extremely efficient on distributed-memory architectures. However, when executed on modern multicore cluster architectures, we face new challenges that can significantly harm AMG's performance. We discuss our experiences on such an architecture and present a set of techniques that help users to overcome the associated problems, including thread and process pinning and correct memory associations. We have implemented most of the techniques in a MultiCore SUPport library (MCSup), which helps to map OpenMP applications to multicore machines. We present results using both an MPI-only and a hybrid MPI/OpenMP model.
Unsworth, Nash; Spillers, Gregory J; Brewer, Gene A
2012-01-01
In two experiments, the locus of individual differences in working memory capacity and long-term memory recall was examined. Participants performed categorical cued and free recall tasks, and individual differences in the dynamics of recall were interpreted in terms of a hierarchical-search framework. The results from this study are in accordance with recent theorizing suggesting a strong relation between working memory capacity and retrieval from long-term memory. Furthermore, the results also indicate that individual differences in categorical recall are partially due to differences in accessibility. In terms of accessibility of target information, two important factors drive the difference between high- and low-working-memory-capacity participants. Low-working-memory-capacity participants fail to utilize appropriate retrieval strategies to access cues, and they also have difficulty resolving cue overload. Thus, when low-working-memory-capacity participants were given specific cues that activated a smaller set of potential targets, their recall performance was the same as that of high-working-memory-capacity participants.
Recurrent Network models of sequence generation and memory
Rajan, Kanaka; Harvey, Christopher D; Tank, David W
2016-01-01
SUMMARY Sequential activation of neurons is a common feature of network activity during a variety of behaviors, including working memory and decision making. Previous network models for sequences and memory emphasized specialized architectures in which a principled mechanism is pre-wired into their connectivity. Here, we demonstrate that starting from random connectivity and modifying a small fraction of connections, a largely disordered recurrent network can produce sequences and implement working memory efficiently. We use this process, called Partial In-Network training (PINning), to model and match cellular-resolution imaging data from the posterior parietal cortex during a virtual memory-guided two-alternative forced choice task [Harvey, Coen and Tank, 2012]. Analysis of the connectivity reveals that sequences propagate by the cooperation between recurrent synaptic interactions and external inputs, rather than through feedforward or asymmetric connections. Together our results suggest that neural sequences may emerge through learning from largely unstructured network architectures. PMID:26971945
Blackcomb: Hardware-Software Co-design for Non-Volatile Memory in Exascale Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schreiber, Robert
Summary of technical results of Blackcomb Memory Devices We explored various different memory technologies (STTRAM, PCRAM, FeRAM, and ReRAM). The progress can be classified into three categories, below. Modeling and Tool Releases Various modeling tools have been developed over the last decade to help in the design of SRAM or DRAM-based memory hierarchies. To explore new design opportunities that NVM technologies can bring to the designers, we have developed similar high-level models for NVM, including PCRAMsim [Dong 2009], NVSim [Dong 2012], and NVMain [Poremba 2012]. NVSim is a circuit-level model for NVM performance, energy, and area estimation, which supports variousmore » NVM technologies, including STT-RAM, PCRAM, ReRAM, and legacy NAND Flash. NVSim is successfully validated against industrial NVM prototypes, and it is expected to help boost architecture-level NVM-related studies. On the other side, NVMain is a cycle accurate main memory simulator designed to simulate emerging nonvolatile memories at the architectural level. We have released these models as open source tools and provided contiguous support to them. We also proposed PS3-RAM, which is a fast, portable and scalable statistical STT-RAM reliability analysis model [Wen 2012]. Design Space Exploration and Optimization With the support of these models, we explore different device/circuit optimization techniques. For example, in [Niu 2012a] we studied the power reduction technique for the application of ECC scheme in ReRAM designs and proposed to use ECC code to relax the BER (Bit Error Rate) requirement of a single memory to improve the write energy consumption and latency for both 1T1R and cross-point ReRAM designs. In [Xu 2011], we proposed a methodology to design STT-RAM for different optimization goals such as read performance, write performance and write energy by leveraging the trade-off between write current and write time of MTJ. We also studied the tradeoffs in building a reliable crosspoint ReRAM array [Niu 2012b]. We have conducted an in depth analysis of the circuit and system level design implications of multi-layer cross-point Resistive RAM (MLCReRAM) from performance, power and reliability perspectives [Xu 2013]. The objective of this study is to understand the design trade-offs of this technology with respect to the MLC Phase Change Memory (MLCPCM).Our MLC ReRAM design at the circuit and system levels indicates that different resistance allocation schemes, programming strategies, peripheral designs, and material selections profoundly affect the area, latency, power, and reliability of MLC ReRAM. Based on this analysis, we conduct two case studies: first we compare MLC ReRAM design against MLC phase-change memory (PCM) and multi-layer cross-point ReRAM design, and point out why multi-level ReRAM is appealing; second we further explore the design space for MLC ReRAM. Architecture and Application We explored hybrid checkpointing using phase-change memory for future exascale systems [Dong 2011] and showed that the use of nonvolatile memory for local checkpointing significantly increases the number of faults covered by local checkpoints and reduces the probability of a global failure in the middle of a global checkpoint to less than 1%. We also proposed a technique called i2WAP to mitigate the write variations in NVM-based last-level cache for the improvement of the NVM lifetime [Wang 2013]. Our wear leveling technique attempts to work around the limitations of write endurance by arranging data access so that write operations can be distributed evenly across all the storage cells. During our intensive research on fault-tolerant NVM design, we found that ECC cannot effectively tolerate hard errors from limited write endurance and process imperfection. Therefore, we devised a novel Point and Discard (PAD) architecture in in [ 2012] as a hard-error-tolerant architecture for ReRAM-based Last Level Caches. PAD improves the lifetime of ReRAM caches by 1.6X-440X under different process variations without performance overhead in the system's early life. We have investigated the applicability of NVM for persistent memory design [Zhao 2013]. New byte addressable NVM enables fast persistent memory that allows in-memory persistent data objects to be updated with much higher throughput. Despite the significant improvement, the performance of these designs is only 50% of the native system with no persistence support, due to the logging or copy-on-write mechanisms used to update the persistent memory. A challenge in this approach is therefore how to efficiently enable atomic, consistent, and durable updates to ensure data persistence that survives application and/or system failures. We have designed a persistent memory system, called Klin, that can provide performance as close as that of the native system. The Klin design adopts a non-volatile cache and a non-volatile main memory for constructing a multi-versioned durable memory system, enabling atomic updates without logging or copy-on-write. Our evaluation shows that the proposed Kiln mechanism can achieve up to 2X of performance improvement to NVRAM-based persistent memory employing write-ahead logging. In addition, our design has numerous practical advantages: a simple and intuitive abstract interface, microarchitecture-level optimizations, fast recovery from failures, and no redundant writes to slow non-volatile storage media. The work was published in MICRO 2013 and received Best Paper Honorable Mentioned Award.« less
Memory availability and referential access
Johns, Clinton L.; Gordon, Peter C.; Long, Debra L.; Swaab, Tamara Y.
2013-01-01
Most theories of coreference specify linguistic factors that modulate antecedent accessibility in memory; however, whether non-linguistic factors also affect coreferential access is unknown. Here we examined the impact of a non-linguistic generation task (letter transposition) on the repeated-name penalty, a processing difficulty observed when coreferential repeated names refer to syntactically prominent (and thus more accessible) antecedents. In Experiment 1, generation improved online (event-related potentials) and offline (recognition memory) accessibility of names in word lists. In Experiment 2, we manipulated generation and syntactic prominence of antecedent names in sentences; both improved online and offline accessibility, but only syntactic prominence elicited a repeated-name penalty. Our results have three important implications: first, the form of a referential expression interacts with an antecedent’s status in the discourse model during coreference; second, availability in memory and referential accessibility are separable; and finally, theories of coreference must better integrate known properties of the human memory system. PMID:24443621
Multiprocessor and memory architecture of the neurocomputer SYNAPSE-1.
Ramacher, U; Raab, W; Anlauf, J; Hachmann, U; Beichter, J; Brüls, N; Wesseling, M; Sicheneder, E; Männer, R; Glass, J
1993-12-01
A general purpose neurocomputer, SYNAPSE-1, which exhibits a multiprocessor and memory architecture is presented. It offers wide flexibility with respect to neural algorithms and a speed-up factor of several orders of magnitude--including learning. The computational power is provided by a 2-dimensional systolic array of neural signal processors. Since the weights are stored outside these NSPs, memory size and processing power can be adapted individually to the application needs. A neural algorithms programming language, embedded in C(+2) has been defined for the user to cope with the neurocomputer. In a benchmark test, the prototype of SYNAPSE-1 was 8000 times as fast as a standard workstation.
Remote Memory and Cortical Synaptic Plasticity Require Neuronal CCCTC-Binding Factor (CTCF).
Kim, Somi; Yu, Nam-Kyung; Shim, Kyu-Won; Kim, Ji-Il; Kim, Hyopil; Han, Dae Hee; Choi, Ja Eun; Lee, Seung-Woo; Choi, Dong Il; Kim, Myung Won; Lee, Dong-Sung; Lee, Kyungmin; Galjart, Niels; Lee, Yong-Seok; Lee, Jae-Hyung; Kaang, Bong-Kiun
2018-05-30
The molecular mechanism of long-term memory has been extensively studied in the context of the hippocampus-dependent recent memory examined within several days. However, months-old remote memory maintained in the cortex for long-term has not been investigated much at the molecular level yet. Various epigenetic mechanisms are known to be important for long-term memory, but how the 3D chromatin architecture and its regulator molecules contribute to neuronal plasticity and systems consolidation is still largely unknown. CCCTC-binding factor (CTCF) is an 11-zinc finger protein well known for its role as a genome architecture molecule. Male conditional knock-out mice in which CTCF is lost in excitatory neurons during adulthood showed normal recent memory in the contextual fear conditioning and spatial water maze tasks. However, they showed remarkable impairments in remote memory in both tasks. Underlying the remote memory-specific phenotypes, we observed that female CTCF conditional knock-out mice exhibit disrupted cortical LTP, but not hippocampal LTP. Similarly, we observed that CTCF deletion in inhibitory neurons caused partial impairment of remote memory. Through RNA sequencing, we observed that CTCF knockdown in cortical neuron culture caused altered expression of genes that are highly involved in cell adhesion, synaptic plasticity, and memory. These results suggest that remote memory storage in the cortex requires CTCF-mediated gene regulation in neurons, whereas recent memory formation in the hippocampus does not. SIGNIFICANCE STATEMENT CCCTC-binding factor (CTCF) is a well-known 3D genome architectural protein that regulates gene expression. Here, we use two different CTCF conditional knock-out mouse lines and reveal, for the first time, that CTCF is critically involved in the regulation of remote memory. We also show that CTCF is necessary for appropriate expression of genes, many of which we found to be involved in the learning- and memory-related processes. Our study provides behavioral and physiological evidence for the involvement of CTCF-mediated gene regulation in the remote long-term memory and elucidates our understanding of systems consolidation mechanisms. Copyright © 2018 the authors 0270-6474/18/385042-11$15.00/0.
Save Now [Y/N]? Machine Memory at War in Iain Banks' "Look to Windward"
ERIC Educational Resources Information Center
Blackmore, Tim
2010-01-01
Creating memory during and after wartime trauma is vexed by state attempts to control public and private discourse. Science fiction author Iain Banks' novel "Look to Windward" proposes different ways of preserving memory and culture, from posthuman memory devices, to artwork, to architecture, to personal, local ways of remembering.…
Reisinger, Florian; Krishna, Ritesh; Ghali, Fawaz; Ríos, Daniel; Hermjakob, Henning; Vizcaíno, Juan Antonio; Jones, Andrew R
2012-03-01
We present a Java application programming interface (API), jmzIdentML, for the Human Proteome Organisation (HUPO) Proteomics Standards Initiative (PSI) mzIdentML standard for peptide and protein identification data. The API combines the power of Java Architecture of XML Binding (JAXB) and an XPath-based random-access indexer to allow a fast and efficient mapping of extensible markup language (XML) elements to Java objects. The internal references in the mzIdentML files are resolved in an on-demand manner, where the whole file is accessed as a random-access swap file, and only the relevant piece of XMLis selected for mapping to its corresponding Java object. The APIis highly efficient in its memory usage and can handle files of arbitrary sizes. The APIfollows the official release of the mzIdentML (version 1.1) specifications and is available in the public domain under a permissive licence at http://www.code.google.com/p/jmzidentml/. © 2012 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Fechner, Hanna B; Pachur, Thorsten; Schooler, Lael J; Mehlhorn, Katja; Battal, Ceren; Volz, Kirsten G; Borst, Jelmer P
2016-12-01
How do people use memories to make inferences about real-world objects? We tested three strategies based on predicted patterns of response times and blood-oxygen-level-dependent (BOLD) responses: one strategy that relies solely on recognition memory, a second that retrieves additional knowledge, and a third, lexicographic (i.e., sequential) strategy, that considers knowledge conditionally on the evidence obtained from recognition memory. We implemented the strategies as computational models within the Adaptive Control of Thought-Rational (ACT-R) cognitive architecture, which allowed us to derive behavioral and neural predictions that we then compared to the results of a functional magnetic resonance imaging (fMRI) study in which participants inferred which of two cities is larger. Overall, versions of the lexicographic strategy, according to which knowledge about many but not all alternatives is searched, provided the best account of the joint patterns of response times and BOLD responses. These results provide insights into the interplay between recognition and additional knowledge in memory, hinting at an adaptive use of these two sources of information in decision making. The results highlight the usefulness of implementing models of decision making within a cognitive architecture to derive predictions on the behavioral and neural level. Copyright © 2016 Elsevier B.V. All rights reserved.
A simple modern correctness condition for a space-based high-performance multiprocessor
NASA Technical Reports Server (NTRS)
Probst, David K.; Li, Hon F.
1992-01-01
A number of U.S. national programs, including space-based detection of ballistic missile launches, envisage putting significant computing power into space. Given sufficient progress in low-power VLSI, multichip-module packaging and liquid-cooling technologies, we will see design of high-performance multiprocessors for individual satellites. In very high speed implementations, performance depends critically on tolerating large latencies in interprocessor communication; without latency tolerance, performance is limited by the vastly differing time scales in processor and data-memory modules, including interconnect times. The modern approach to tolerating remote-communication cost in scalable, shared-memory multiprocessors is to use a multithreaded architecture, and alter the semantics of shared memory slightly, at the price of forcing the programmer either to reason about program correctness in a relaxed consistency model or to agree to program in a constrained style. The literature on multiprocessor correctness conditions has become increasingly complex, and sometimes confusing, which may hinder its practical application. We propose a simple modern correctness condition for a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and the parallel programming system.
Avionics Architecture Standards as an Approach to Obsolescence Management
2000-10-01
and goals is one method of system. The term System Architecture refers to a achieving the necessary critical mass of skilled and consistent set of such...Processing Module (GPM), Mass Memory Module executed on the modules within an ASAAC system will (MMM) and Power Conversion Module (PCM). be stored in a central...location, the Mass Memory * MOS -Module Support Layer to Operating System Module (MMM). Therefore, if modules are to be The purpose of the MOS
The dynamic interplay between acute psychosocial stress, emotion and autobiographical memory.
Sheldon, Signy; Chu, Sonja; Nitschke, Jonas P; Pruessner, Jens C; Bartz, Jennifer A
2018-06-06
Although acute psychosocial stress can impact autobiographical memory retrieval, the nature of this effect is not entirely clear. One reason for this ambiguity is because stress can have opposing effects on the different stages of autobiographical memory retrieval. We addressed this issue by testing how acute stress affects three stages of the autobiographical memory retrieval - accessing, recollecting and reconsolidating a memory. We also investigate the influence of emotion valence on this effect. In a between-subjects design, participants were first exposed to an acute psychosocial stressor or a control task. Next, the participants were shown positive, negative or neutral retrieval cues and asked to access and describe autobiographical memories. After a three to four day delay, participants returned for a second session in which they described these autobiographical memories. During initial retrieval, stressed participants were slower to access memories than were control participants; moreover, cortisol levels were positively associated with response times to access positively-cued memories. There were no effects of stress on the amount of details used to describe memories during initial retrieval, but stress did influence memory detail during session two. During session two, stressed participants recovered significantly more details, particularly emotional ones, from the remembered events than control participants. Our results indicate that the presence of stress impairs the ability to access consolidated autobiographical memories; moreover, although stress has no effect on memory recollection, stress alters how recollected experiences are reconsolidated back into memory traces.
Federal Register 2010, 2011, 2012, 2013, 2014
2010-09-08
... ARCHITECTURAL AND TRANSPORTATION BARRIERS COMPLIANCE BOARD 36 CFR Part 1192 [Docket No. ATBCB 2010... Vehicles AGENCY: Architectural and Transportation Barriers Compliance Board. ACTION: Notice of public hearings. SUMMARY: The Architectural and Transportation Barriers Compliance Board (Access Board) will hold...
NASA Technical Reports Server (NTRS)
Irom, Farokh; Nguyen, Duc N.
2010-01-01
High-density, commercial, nonvolatile flash memories with NAND architecture are now available from several manufacturers. This report examines SEE effects and TID response in single-level cell (SLC) and multi-level cell (MLC) NAND flash memories manufactured by Micron Technology.
ERIC Educational Resources Information Center
Oberauer, Klaus; Bialkova, Svetlana
2009-01-01
Processing information in working memory requires selective access to a subset of working-memory contents by a focus of attention. Complex cognition often requires joint access to 2 items in working memory. How does the focus select 2 items? Two experiments with an arithmetic task and 1 with a spatial task investigate time demands for successive…
Federal Register 2010, 2011, 2012, 2013, 2014
2013-06-13
... INTERNATIONAL TRADE COMMISSION [Investigation No. 337-TA-792] Certain Static Random Access Memories and Products Containing Same; Commission Determination Affirming a Final Initial Determination..., and the sale within the United States after importation of certain static random access memories and...
... free mailed brochure Table of Contents Introduction The Architecture of the Neuron Birth Migration Differentiation Death Hope ... generated neurons in learning and memory. Neuron The Architecture of the Neuron The central nervous system (which ...
Exascale Hardware Architectures Working Group
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hemmert, S; Ang, J; Chiang, P
2011-03-15
The ASC Exascale Hardware Architecture working group is challenged to provide input on the following areas impacting the future use and usability of potential exascale computer systems: processor, memory, and interconnect architectures, as well as the power and resilience of these systems. Going forward, there are many challenging issues that will need to be addressed. First, power constraints in processor technologies will lead to steady increases in parallelism within a socket. Additionally, all cores may not be fully independent nor fully general purpose. Second, there is a clear trend toward less balanced machines, in terms of compute capability compared tomore » memory and interconnect performance. In order to mitigate the memory issues, memory technologies will introduce 3D stacking, eventually moving on-socket and likely on-die, providing greatly increased bandwidth but unfortunately also likely providing smaller memory capacity per core. Off-socket memory, possibly in the form of non-volatile memory, will create a complex memory hierarchy. Third, communication energy will dominate the energy required to compute, such that interconnect power and bandwidth will have a significant impact. All of the above changes are driven by the need for greatly increased energy efficiency, as current technology will prove unsuitable for exascale, due to unsustainable power requirements of such a system. These changes will have the most significant impact on programming models and algorithms, but they will be felt across all layers of the machine. There is clear need to engage all ASC working groups in planning for how to deal with technological changes of this magnitude. The primary function of the Hardware Architecture Working Group is to facilitate codesign with hardware vendors to ensure future exascale platforms are capable of efficiently supporting the ASC applications, which in turn need to meet the mission needs of the NNSA Stockpile Stewardship Program. This issue is relatively immediate, as there is only a small window of opportunity to influence hardware design for 2018 machines. Given the short timeline a firm co-design methodology with vendors is of prime importance.« less
A Parallel Rendering Algorithm for MIMD Architectures
NASA Technical Reports Server (NTRS)
Crockett, Thomas W.; Orloff, Tobias
1991-01-01
Applications such as animation and scientific visualization demand high performance rendering of complex three dimensional scenes. To deliver the necessary rendering rates, highly parallel hardware architectures are required. The challenge is then to design algorithms and software which effectively use the hardware parallelism. A rendering algorithm targeted to distributed memory MIMD architectures is described. For maximum performance, the algorithm exploits both object-level and pixel-level parallelism. The behavior of the algorithm is examined both analytically and experimentally. Its performance for large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 shows increasing performance from 1 to 128 processors across a wide range of scene complexities. It is shown that minimal modifications to the algorithm will adapt it for use on shared memory architectures as well.
SRAM Based Re-programmable FPGA for Space Applications
NASA Technical Reports Server (NTRS)
Wang, J. J.; Sun, J. S.; Cronquist, B. E.; McCollum, J. L.; Speers, T. M.; Plants, W. C.; Katz, R. B.
1999-01-01
An SRAM (static random access memory)-based reprogrammable FPGA (field programmable gate array) is investigated for space applications. A new commercial prototype, named the RS family, was used as an example for the investigation. The device is fabricated in a 0.25 micrometers CMOS technology. Its architecture is reviewed to provide a better understanding of the impact of single event upset (SEU) on the device during operation. The SEU effect of different memories available on the device is evaluated. Heavy ion test data and SPICE simulations are used integrally to extract the threshold LET (linear energy transfer). Together with the saturation cross-section measurement from the layout, a rate prediction is done on each memory type. The SEU in the configuration SRAM is identified as the dominant failure mode and is discussed in detail. The single event transient error in combinational logic is also investigated and simulated by SPICE. SEU mitigation by hardening the memories and employing EDAC (error detection and correction) at the device level are presented. For the configuration SRAM (CSRAM) cell, the trade-off between resistor de-coupling and redundancy hardening techniques are investigated with interesting results. Preliminary heavy ion test data show no sign of SEL (single event latch-up). With regard to ionizing radiation effects, the increase in static leakage current (static I(sub CC)) measured indicates a device tolerance of approximately 50krad(Si).
Parallel language constructs for tensor product computations on loosely coupled architectures
NASA Technical Reports Server (NTRS)
Mehrotra, Piyush; Vanrosendale, John
1989-01-01
Distributed memory architectures offer high levels of performance and flexibility, but have proven awkard to program. Current languages for nonshared memory architectures provide a relatively low level programming environment, and are poorly suited to modular programming, and to the construction of libraries. A set of language primitives designed to allow the specification of parallel numerical algorithms at a higher level is described. Tensor product array computations are focused on along with a simple but important class of numerical algorithms. The problem of programming 1-D kernal routines is focused on first, such as parallel tridiagonal solvers, and then how such parallel kernels can be combined to form parallel tensor product algorithms is examined.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Smith, Tyler Barratt; Urrea, Jorge Mario
2012-06-01
The aim of the Authenticating Cache architecture is to ensure that machine instructions in a Read Only Memory (ROM) are legitimate from the time the ROM image is signed (immediately after compilation) to the time they are placed in the cache for the processor to consume. The proposed architecture allows the detection of ROM image modifications during distribution or when it is loaded into memory. It also ensures that modified instructions will not execute in the processor-as the cache will not be loaded with a page that fails an integrity check. The authenticity of the instruction stream can also bemore » verified in this architecture. The combination of integrity and authenticity assurance greatly improves the security profile of a system.« less
NASA Technical Reports Server (NTRS)
Irom, Farokh; Nguyen, Duc N.
2011-01-01
High-density, commercial, nonvolatile flash memories with NAND architecture are now available from several manufacturers. This report examines SEE effects and TID response in single-level cell (SLC) 32Gb and multi-level cell (MLC) 64Gb NAND flash memories manufactured by Micron Technology.
Expert Systems on Multiprocessor Architectures. Volume 2. Technical Reports
1991-06-01
Report RC 12936 (#58037). IBM T. J. Wartson Reiearch Center. July 1987. Alan Jay Smith. Cache memories. Coniputing Sitrry., 1.1(3): I.3-5:30...basic-shared is an instrument for ashared memory design. The components panels are processor- qload-scrolling-bar-panel, memory-qload-scrolling-bar-panel
Federal Register 2010, 2011, 2012, 2013, 2014
2013-05-02
... INTERNATIONAL TRADE COMMISSION [Investigation No. 337-TA-792] Certain Static Random Access Memories and Products Containing Same; Commission Determination To Review in Part a Final Initial... States after importation of certain static random access memories and products containing the same by...
NASA Technical Reports Server (NTRS)
Israel, David J.
2005-01-01
The NASA Space Network (SN) supports a variety of missions using the Tracking and Data Relay Satellite System (TDRSS), which includes ground stations in White Sands, New Mexico and Guam. A Space Network IP Services (SNIS) architecture is being developed to support future users with requirements for end-to-end Internet Protocol (IP) communications. This architecture will support all IP protocols, including Mobile IP, over TDRSS Single Access, Multiple Access, and Demand Access Radio Frequency (RF) links. This paper will describe this architecture and how it can enable Low Earth Orbiting IP satellite missions.
Programming time-multiplexed reconfigurable hardware using a scalable neuromorphic compiler.
Minkovich, Kirill; Srinivasa, Narayan; Cruz-Albrecht, Jose M; Cho, Youngkwan; Nogin, Aleksey
2012-06-01
Scalability and connectivity are two key challenges in designing neuromorphic hardware that can match biological levels. In this paper, we describe a neuromorphic system architecture design that addresses an approach to meet these challenges using traditional complementary metal-oxide-semiconductor (CMOS) hardware. A key requirement in realizing such neural architectures in hardware is the ability to automatically configure the hardware to emulate any neural architecture or model. The focus for this paper is to describe the details of such a programmable front-end. This programmable front-end is composed of a neuromorphic compiler and a digital memory, and is designed based on the concept of synaptic time-multiplexing (STM). The neuromorphic compiler automatically translates any given neural architecture to hardware switch states and these states are stored in digital memory to enable desired neural architectures. STM enables our proposed architecture to address scalability and connectivity using traditional CMOS hardware. We describe the details of the proposed design and the programmable front-end, and provide examples to illustrate its capabilities. We also provide perspectives for future extensions and potential applications.
Low latency and persistent data storage
Fitch, Blake G; Franceschini, Michele M; Jagmohan, Ashish; Takken, Todd E
2014-02-18
Persistent data storage is provided by a method that includes receiving a low latency store command that includes write data. The write data is written to a first memory device that is implemented by a nonvolatile solid-state memory technology characterized by a first access speed. It is acknowledged that the write data has been successfully written to the first memory device. The write data is written to a second memory device that is implemented by a volatile memory technology. At least a portion of the data in the first memory device is written to a third memory device when a predetermined amount of data has been accumulated in the first memory device. The third memory device is implemented by a nonvolatile solid-state memory technology characterized by a second access speed that is slower than the first access speed.
1989-05-12
USA Resonant tunneling transistors and New III-V memory devices for new circuit architectures with reduced complexity F. Capasso, Bell. Murray Hill...the evaporation, or by selective oxidation of As, leaving metallic Ga clusters and b) the interdiffusive deterioration of metal contacts on GaAs...VEB (My) Resonant Tunneling Transistors and New III-V Memory Devices for New Circuit Architectures with Reduced Complexity . Invited: F. Capasso
System and method for programmable bank selection for banked memory subsystems
Blumrich, Matthias A.; Chen, Dong; Gara, Alan G.; Giampapa, Mark E.; Hoenicke, Dirk; Ohmacht, Martin; Salapura, Valentina; Sugavanam, Krishnan
2010-09-07
A programmable memory system and method for enabling one or more processor devices access to shared memory in a computing environment, the shared memory including one or more memory storage structures having addressable locations for storing data. The system comprises: one or more first logic devices associated with a respective one or more processor devices, each first logic device for receiving physical memory address signals and programmable for generating a respective memory storage structure select signal upon receipt of pre-determined address bit values at selected physical memory address bit locations; and, a second logic device responsive to each of the respective select signal for generating an address signal used for selecting a memory storage structure for processor access. The system thus enables each processor device of a computing environment memory storage access distributed across the one or more memory storage structures.
NASA Astrophysics Data System (ADS)
Natsui, Masanori; Hanyu, Takahiro
2018-04-01
In realizing a nonvolatile microcontroller unit (MCU) for sensor nodes in Internet-of-Things (IoT) applications, it is important to solve the data-transfer bottleneck between the central processing unit (CPU) and the nonvolatile memory constituting the MCU. As one circuit-oriented approach to solving this problem, we propose a memory access minimization technique for magnetoresistive-random-access-memory (MRAM)-embedded nonvolatile MCUs. In addition to multiplexing and prefetching of memory access, the proposed technique realizes efficient instruction fetch by eliminating redundant memory access while considering the code length of the instruction to be fetched and the transition of the memory address to be accessed. As a result, the performance of the MCU can be improved while relaxing the performance requirement for the embedded MRAM, and compact and low-power implementation can be performed as compared with the conventional cache-based one. Through the evaluation using a system consisting of a general purpose 32-bit CPU and embedded MRAM, it is demonstrated that the proposed technique increases the peak efficiency of the system up to 3.71 times, while a 2.29-fold area reduction is achieved compared with the cache-based one.
Next Generation Mass Memory Architecture
NASA Astrophysics Data System (ADS)
Herpel, H.-J.; Stahle, M.; Lonsdorfer, U.; Binzer, N.
2010-08-01
Future Mass Memory units will have to cope with various demanding requirements driven by onboard instruments (optical and SAR) that generate a huge amount of data (>10TBit) at a data rate > 6 Gbps. For downlink data rates around 3 Gbps will be feasible using latest ka-band technology together with Variable Coding and Modulation (VCM) techniques. These high data rates and storage capacities need to be effectively managed. Therefore, data structures and data management functions have to be improved and adapted to existing standards like the Packet Utilisation Standard (PUS). In this paper we will present a highly modular and scalable architectural approach for mass memories in order to support a wide range of mission requirements.
Content addressable memory project
NASA Technical Reports Server (NTRS)
Hall, Josh; Levy, Saul; Smith, D.; Wei, S.; Miyake, K.; Murdocca, M.
1991-01-01
The progress on the Rutgers CAM (Content Addressable Memory) Project is described. The overall design of the system is completed at the architectural level and described. The machine is composed of two kinds of cells: (1) the CAM cells which include both memory and processor, and support local processing within each cell; and (2) the tree cells, which have smaller instruction set, and provide global processing over the CAM cells. A parameterized design of the basic CAM cell is completed. Progress was made on the final specification of the CPS. The machine architecture was driven by the design of algorithms whose requirements are reflected in the resulted instruction set(s). A few of these algorithms are described.
Zhao, Jun; Chen, Min; Wang, Xiaoyan; Zhao, Xiaodong; Wang, Zhenwen; Dang, Zhi-Min; Ma, Lan; Hu, Guo-Hua; Chen, Fenghua
2013-06-26
In this paper, the triple shape memory effects (SMEs) observed in chemically cross-linked polyethylene (PE)/polypropylene (PP) blends with cocontinuous architecture are systematically investigated. The cocontinuous window of typical immiscible PE/PP blends is the volume fraction of PE (v(PE)) of ca. 30-70 vol %. This architecture can be stabilized by chemical cross-linking. Different initiators, 2,5-dimethyl-2,5-di(tert-butylperoxy)-hexane (DHBP), dicumylperoxide (DCP) coupled with divinylbenzene (DVB) (DCP-DVB), and their mixture (DHBP/DCP-DVB), are used for the cross-linking. According to the differential scanning calorimetry (DSC) measurements and gel fraction calculations, DHBP produces the best cross-linking and DCP-DVB the worst, and the mixture, DHBP/DCP-DVB, is in between. The chemical cross-linking causes lower melting temperature (Tm) and smaller melting enthalpy (ΔHm). The prepared triple shape memory polymers (SMPs) by cocontinuous immiscible PE/PP blends with v(PE) of 50 vol % show pronounced triple SMEs in the dynamic mechanical thermal analysis (DMTA) and visual observation. This new strategy of chemically cross-linked immiscible blends with cocontinuous architecture can be used to design and prepare new SMPs with triple SMEs.
Multiprocessor architectural study
NASA Technical Reports Server (NTRS)
Kosmala, A. L.; Stanten, S. F.; Vandever, W. H.
1972-01-01
An architectural design study was made of a multiprocessor computing system intended to meet functional and performance specifications appropriate to a manned space station application. Intermetrics, previous experience, and accumulated knowledge of the multiprocessor field is used to generate a baseline philosophy for the design of a future SUMC* multiprocessor. Interrupts are defined and the crucial questions of interrupt structure, such as processor selection and response time, are discussed. Memory hierarchy and performance is discussed extensively with particular attention to the design approach which utilizes a cache memory associated with each processor. The ability of an individual processor to approach its theoretical maximum performance is then analyzed in terms of a hit ratio. Memory management is envisioned as a virtual memory system implemented either through segmentation or paging. Addressing is discussed in terms of various register design adopted by current computers and those of advanced design.
Long-term knowledge acquisition using contextual information in a memory-inspired robot architecture
NASA Astrophysics Data System (ADS)
Pratama, Ferdian; Mastrogiovanni, Fulvio; Lee, Soon Geul; Chong, Nak Young
2017-03-01
In this paper, we present a novel cognitive framework allowing a robot to form memories of relevant traits of its perceptions and to recall them when necessary. The framework is based on two main principles: on the one hand, we propose an architecture inspired by current knowledge in human memory organisation; on the other hand, we integrate such an architecture with the notion of context, which is used to modulate the knowledge acquisition process when consolidating memories and forming new ones, as well as with the notion of familiarity, which is employed to retrieve proper memories given relevant cues. Although much research has been carried out, which exploits Machine Learning approaches to provide robots with internal models of their environment (including objects and occurring events therein), we argue that such approaches may not be the right direction to follow if a long-term, continuous knowledge acquisition is to be achieved. As a case study scenario, we focus on both robot-environment and human-robot interaction processes. In case of robot-environment interaction, a robot performs pick and place movements using the objects in the workspace, at the same time observing their displacement on a table in front of it, and progressively forms memories defined as relevant cues (e.g. colour, shape or relative position) in a context-aware fashion. As far as human-robot interaction is concerned, the robot can recall specific snapshots representing past events using both sensory information and contextual cues upon request by humans.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Learn, Mark Walter
Sandia National Laboratories is currently developing new processing and data communication architectures for use in future satellite payloads. These architectures will leverage the flexibility and performance of state-of-the-art static-random-access-memory-based Field Programmable Gate Arrays (FPGAs). One such FPGA is the radiation-hardened version of the Virtex-5 being developed by Xilinx. However, not all features of this FPGA are being radiation-hardened by design and could still be susceptible to on-orbit upsets. One such feature is the embedded hard-core PPC440 processor. Since this processor is implemented in the FPGA as a hard-core, traditional mitigation approaches such as Triple Modular Redundancy (TMR) are not availablemore » to improve the processor's on-orbit reliability. The goal of this work is to investigate techniques that can help mitigate the embedded hard-core PPC440 processor within the Virtex-5 FPGA other than TMR. Implementing various mitigation schemes reliably within the PPC440 offers a powerful reconfigurable computing resource to these node-based processing architectures. This document summarizes the work done on the cache mitigation scheme for the embedded hard-core PPC440 processor within the Virtex-5 FPGAs, and describes in detail the design of the cache mitigation scheme and the testing conducted at the radiation effects facility on the Texas A&M campus.« less
Mozaffari, Brian
2014-01-01
Based on the notion that the brain is equipped with a hierarchical organization, which embodies environmental contingencies across many time scales, this paper suggests that the medial temporal lobe (MTL)—located deep in the hierarchy—serves as a bridge connecting supra- to infra—MTL levels. Bridging the upper and lower regions of the hierarchy provides a parallel architecture that optimizes information flow between upper and lower regions to aid attention, encoding, and processing of quick complex visual phenomenon. Bypassing intermediate hierarchy levels, information conveyed through the MTL “bridge” allows upper levels to make educated predictions about the prevailing context and accordingly select lower representations to increase the efficiency of predictive coding throughout the hierarchy. This selection or activation/deactivation is associated with endogenous attention. In the event that these “bridge” predictions are inaccurate, this architecture enables the rapid encoding of novel contingencies. A review of hierarchical models in relation to memory is provided along with a new theory, Medial-temporal-lobe Conduit for Parallel Connectivity (MCPC). In this scheme, consolidation is considered as a secondary process, occurring after a MTL-bridged connection, which eventually allows upper and lower levels to access each other directly. With repeated reactivations, as contingencies become consolidated, less MTL activity is predicted. Finally, MTL bridging may aid processing transient but structured perceptual events, by allowing communication between upper and lower levels without calling on intermediate levels of representation. PMID:25426036
DOE Office of Scientific and Technical Information (OSTI.GOV)
Venkata, Manjunath Gorentla; Aderholdt, William F
The pre-exascale systems are expected to have a significant amount of hierarchical and heterogeneous on-node memory, and this trend of system architecture in extreme-scale systems is expected to continue into the exascale era. along with hierarchical-heterogeneous memory, the system typically has a high-performing network ad a compute accelerator. This system architecture is not only effective for running traditional High Performance Computing (HPC) applications (Big-Compute), but also for running data-intensive HPC applications and Big-Data applications. As a consequence, there is a growing desire to have a single system serve the needs of both Big-Compute and Big-Data applications. Though the system architecturemore » supports the convergence of the Big-Compute and Big-Data, the programming models and software layer have yet to evolve to support either hierarchical-heterogeneous memory systems or the convergence. A programming abstraction to address this problem. The programming abstraction is implemented as a software library and runs on pre-exascale and exascale systems supporting current and emerging system architecture. Using distributed data-structures as a central concept, it provides (1) a simple, usable, and portable abstraction for hierarchical-heterogeneous memory and (2) a unified programming abstraction for Big-Compute and Big-Data applications.« less
Space Radiation Effects in Advanced Flash Memories
NASA Technical Reports Server (NTRS)
Johnston, A. H.
2001-01-01
Memory storage requirements in space systems have steadily increased, much like storage requirements in terrestrial systems. Large arrays of dynamic memories (DRAMs) have been used in solid-state recorders, relying on a combination of shielding and error-detection-and correction (EDAC) to overcome the extreme sensitivity of DRAMs to space radiation. For example, a 2-Gbit memory (with 4-Mb DRAMs) used on the Clementine mission functioned perfectly during its moon mapping mission, in spite of an average of 71 memory bit flips per day from heavy ions. Although EDAC worked well with older types of memory circuits, newer DRAMs use extremely complex internal architectures which has made it increasingly difficult to implement EDAC. Some newer DRAMs have also exhibited catastrophic latchup. Flash memories are an intriguing alternative to DRAMs because of their nonvolatile storage and extremely high storage density, particularly for applications where writing is done relatively infrequently. This paper discusses radiation effects in advanced flash memories, including general observations on scaling and architecture as well as the specific experience obtained at the Jet Propulsion Laboratory in evaluating high-density flash memories for use on the NASA mission to Europa, one of Jupiter's moons. This particular mission must pass through the Jovian radiation belts, which imposes a very demanding radiation requirement.
Federal Register 2010, 2011, 2012, 2013, 2014
2011-01-13
... DEPARTMENT OF COMMERCE International Trade Administration [C-580-851] Dynamic Random Access Memory... administrative review of the countervailing duty order on dynamic random access memory semiconductors from the... following events have occurred since the publication of the preliminary results of this review. See Dynamic...
Federal Register 2010, 2011, 2012, 2013, 2014
2010-04-20
... DEPARTMENT OF COMMERCE International Trade Administration [C-580-851] Dynamic Random Access Memory Semiconductors from the Republic of Korea: Extension of Time Limit for Preliminary Results of Countervailing Duty... access memory semiconductors from the Republic of Korea, covering the period January 1, 2008 through...
Accessibility versus Accuracy in Retrieving Spatial Memory: Evidence for Suboptimal Assumed Headings
ERIC Educational Resources Information Center
Yerramsetti, Ashok; Marchette, Steven A.; Shelton, Amy L.
2013-01-01
Orientation dependence in spatial memory has often been interpreted in terms of accessibility: Object locations are encoded relative to a reference orientation that affords the most accurate access to spatial memory. An open question, however, is whether people naturally use this "preferred" orientation whenever recalling the space. We…
78 FR 59316 - Rail Vehicles Access Advisory Committee Meetings
Federal Register 2010, 2011, 2012, 2013, 2014
2013-09-26
...-0001] RIN 3014-AA42 Rail Vehicles Access Advisory Committee Meetings AGENCY: Architectural and... Vehicles Access Advisory Committee (Committee) will hold its first meeting. We, the Architectural and... transportation vehicles that operate on fixed guideway systems (e.g., rapid rail, light rail, commuter rail...
ASIC-based architecture for the real-time computation of 2D convolution with large kernel size
NASA Astrophysics Data System (ADS)
Shao, Rui; Zhong, Sheng; Yan, Luxin
2015-12-01
Bidimensional convolution is a low-level processing algorithm of interest in many areas, but its high computational cost constrains the size of the kernels, especially in real-time embedded systems. This paper presents a hardware architecture for the ASIC-based implementation of 2-D convolution with medium-large kernels. Aiming to improve the efficiency of storage resources on-chip, reducing off-chip bandwidth of these two issues, proposed construction of a data cache reuse. Multi-block SPRAM to cross cached images and the on-chip ping-pong operation takes full advantage of the data convolution calculation reuse, design a new ASIC data scheduling scheme and overall architecture. Experimental results show that the structure can achieve 40× 32 size of template real-time convolution operations, and improve the utilization of on-chip memory bandwidth and on-chip memory resources, the experimental results show that the structure satisfies the conditions to maximize data throughput output , reducing the need for off-chip memory bandwidth.
Low latency and persistent data storage
Fitch, Blake G; Franceschini, Michele M; Jagmohan, Ashish; Takken, Todd
2014-11-04
Persistent data storage is provided by a computer program product that includes computer program code configured for receiving a low latency store command that includes write data. The write data is written to a first memory device that is implemented by a nonvolatile solid-state memory technology characterized by a first access speed. It is acknowledged that the write data has been successfully written to the first memory device. The write data is written to a second memory device that is implemented by a volatile memory technology. At least a portion of the data in the first memory device is written to a third memory device when a predetermined amount of data has been accumulated in the first memory device. The third memory device is implemented by a nonvolatile solid-state memory technology characterized by a second access speed that is slower than the first access speed.
More than a feeling: Emotional cues impact the access and experience of autobiographical memories.
Sheldon, Signy; Donahue, Julia
2017-07-01
Remembering is impacted by several factors of retrieval, including the emotional content of a memory cue. Here we tested how musical retrieval cues that differed on two dimensions of emotion-valence (positive and negative) and arousal (high and low)-impacted the following aspects of autobiographical memory recall: the response time to access a past personal event, the experience of remembering (ratings of memory vividness), the emotional content of a cued memory (ratings of event arousal and valence), and the type of event recalled (ratings of event energy, socialness, and uniqueness). We further explored how cue presentation affected autobiographical memory retrieval by administering cues of similar arousal and valence levels in a blocked fashion to one half of the tested participants, and randomly to the other half. We report three main findings. First, memories were accessed most quickly in response to musical cues that were highly arousing and positive in emotion. Second, we observed a relation between a cue and the elicited memory's emotional valence but not arousal; however, both the cue valence and arousal related to the nature of the recalled event. Specifically, high cue arousal led to lower memory vividness and uniqueness ratings, but cues with both high arousal and positive valence were associated with memories rated as more social and energetic. Finally, cue presentation impacted both how quickly and specifically memories were accessed and how cue valence affected the memory vividness ratings. The implications of these findings for views of how emotion directs the access to memories and the experience of remembering are discussed.
A GaAs vector processor based on parallel RISC microprocessors
NASA Astrophysics Data System (ADS)
Misko, Tim A.; Rasset, Terry L.
A vector processor architecture based on the development of a 32-bit microprocessor using gallium arsenide (GaAs) technology has been developed. The McDonnell Douglas vector processor (MVP) will be fabricated completely from GaAs digital integrated circuits. The MVP architecture includes a vector memory of 1 megabyte, a parallel bus architecture with eight processing elements connected in parallel, and a control processor. The processing elements consist of a reduced instruction set CPU (RISC) with four floating-point coprocessor units and necessary memory interface functions. This architecture has been simulated for several benchmark programs including complex fast Fourier transform (FFT), complex inner product, trigonometric functions, and sort-merge routine. The results of this study indicate that the MVP can process a 1024-point complex FFT at a speed of 112 microsec (389 megaflops) while consuming approximately 618 W of power in a volume of approximately 0.1 ft-cubed.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.
1988-01-01
The purpose is to document research to develop strategies for concurrent processing of complex algorithms in data driven architectures. The problem domain consists of decision-free algorithms having large-grained, computationally complex primitive operations. Such are often found in signal processing and control applications. The anticipated multiprocessor environment is a data flow architecture containing between two and twenty computing elements. Each computing element is a processor having local program memory, and which communicates with a common global data memory. A new graph theoretic model called ATAMM which establishes rules for relating a decomposed algorithm to its execution in a data flow architecture is presented. The ATAMM model is used to determine strategies to achieve optimum time performance and to develop a system diagnostic software tool. In addition, preliminary work on a new multiprocessor operating system based on the ATAMM specifications is described.
Low Power LDPC Code Decoder Architecture Based on Intermediate Message Compression Technique
NASA Astrophysics Data System (ADS)
Shimizu, Kazunori; Togawa, Nozomu; Ikenaga, Takeshi; Goto, Satoshi
Reducing the power dissipation for LDPC code decoder is a major challenging task to apply it to the practical digital communication systems. In this paper, we propose a low power LDPC code decoder architecture based on an intermediate message-compression technique which features as follows: (i) An intermediate message compression technique enables the decoder to reduce the required memory capacity and write power dissipation. (ii) A clock gated shift register based intermediate message memory architecture enables the decoder to decompress the compressed messages in a single clock cycle while reducing the read power dissipation. The combination of the above two techniques enables the decoder to reduce the power dissipation while keeping the decoding throughput. The simulation results show that the proposed architecture improves the power efficiency up to 52% and 18% compared to that of the decoder based on the overlapped schedule and the rapid convergence schedule without the proposed techniques respectively.
ERIC Educational Resources Information Center
Acheson, Daniel J.; MacDonald, Maryellen C.
2009-01-01
Many accounts of working memory posit specialized storage mechanisms for the maintenance of serial order. We explore an alternative, that maintenance is achieved through temporary activation in the language production architecture. Four experiments examined the extent to which the phonological similarity effect can be explained as a sublexical…
Cache write generate for parallel image processing on shared memory architectures.
Wittenbrink, C M; Somani, A K; Chen, C H
1996-01-01
We investigate cache write generate, our cache mode invention. We demonstrate that for parallel image processing applications, the new mode improves main memory bandwidth, CPU efficiency, cache hits, and cache latency. We use register level simulations validated by the UW-Proteus system. Many memory, cache, and processor configurations are evaluated.
Price, John M.; Colflesh, Gregory J. H.; Cerella, John; Verhaeghen, Paul
2014-01-01
We investigated the effects of 10 hours of practice on variations of the N-Back task to investigate the processes underlying possible expansion of the focus of attention within working memory. Using subtractive logic, we showed that random access (i.e., Sternberg-like search) yielded a modest effect (a 50% increase in speed) whereas the processes of forward access (i.e., retrieval in order, as in a standard N-Back task) and updating (i.e., changing the contents of working memory) were executed about 5 times faster after extended practice. We additionally found that extended practice increased working memory capacity as measured by the size of the focus of attention for the forward-access task, but not for variations where probing was in random order. This suggests that working memory capacity may depend on the type of search process engaged, and that certain working-memory-related cognitive processes are more amenable to practice than others. PMID:24486803
How intention and monitoring your thoughts influence characteristics of autobiographical memories.
Barzykowski, Krystian; Staugaard, Søren Risløv
2018-05-01
Involuntary autobiographical memories come to mind effortlessly and unintended, but the mechanisms of their retrieval are not fully understood. We hypothesize that involuntary retrieval depends on memories that are highly accessible (e.g., intense, unusual, recent, rehearsed), while the elaborate search that characterizes voluntary retrieval also produces memories that are mundane, repeated or distant - memories with low accessibility. Previous research provides some evidence for this 'threshold hypothesis'. However, in almost every prior study, participants have been instructed to report only memories while ignoring other thoughts. It is possible that such an instruction can modify the phenomenological characteristics of involuntary memories. This study aimed to investigate the effects of retrieval intentionality (i.e., wanting to retrieve a memory) and selective monitoring (i.e., instructions to report only memories) on the phenomenology of autobiographical memories. Participants were instructed to (1) intentionally retrieve autobiographical memories, (2) intentionally retrieve any type of thought (3) wait for an autobiographical memory to spontaneously appear, or (4) wait for any type of thought to spontaneously appear. They rated the mental content on a number of phenomenological characteristics both during retrieval and retrospectively following retrieval. The results support the prediction that highly accessible memories mostly enter awareness unintended and without selective monitoring, while memories with low accessibility rely on intention and selective monitoring. We discuss the implications of these effects. © 2017 The British Psychological Society.
Electrochromic conductive polymer fuses for hybrid organic/inorganic semiconductor memories
NASA Astrophysics Data System (ADS)
Möller, Sven; Forrest, Stephen R.; Perlov, Craig; Jackson, Warren; Taussig, Carl
2003-12-01
We demonstrate a nonvolatile, write-once-read-many-times (WORM) memory device employing a hybrid organic/inorganic semiconductor architecture consisting of thin film p-i-n silicon diode on a stainless steel substrate integrated in series with a conductive polymer fuse. The nonlinearity of the silicon diodes enables a passive matrix memory architecture, while the conductive polyethylenedioxythiophene:polystyrene sulfonic acid polymer serves as a reliable switch with fuse-like behavior for data storage. The polymer can be switched at ˜2 μs, resulting in a permanent decrease of conductivity of the memory pixel by up to a factor of 103. The switching mechanism is primarily due to a current and thermally dependent redox reaction in the polymer, limited by the double injection of both holes and electrons. The switched device performance does not degrade after many thousand read cycles in ambient at room temperature. Our results suggest that low cost, organic/inorganic WORM memories are feasible for light weight, high density, robust, and fast archival storage applications.
Remote Control and Monitoring of VLBI Experiments by Smartphones
NASA Astrophysics Data System (ADS)
Ruztort, C. H.; Hase, H.; Zapata, O.; Pedreros, F.
2012-12-01
For the remote control and monitoring of VLBI operations, we developed a software optimized for smartphones. This is a new tool based on a client-server architecture with a Web interface optimized for smartphone screens and cellphone networks. The server uses variables of the Field System and its station specific parameters stored in the shared memory. The client running on the smartphone by a Web interface analyzes and visualizes the current status of the radio telescope, receiver, schedule, and recorder. In addition, it allows commands to be sent remotely to the Field System computer and displays the log entries. The user has full access to the entire operation process, which is important in emergency cases. The software also integrates a webcam interface.
NASA Astrophysics Data System (ADS)
Tsai, Chih-Wei; Lo, Yu-Lung; Chang, Chia-Chen; Liu, Han-Ying; Yang, Wei-Bin; Cheng, Kuo-Hsing
2017-04-01
A synchronous and highly accurate all-digital duty-cycle corrector (ADDCC), which uses simplified dual-loop architecture, is presented in this paper. To explain the operational principle, a detailed circuit description and formula derivation are provided. To verify the proposed design, a chip was fabricated through the 0.18-µm standard complementary metal oxide semiconductor process with a core area of 0.091 mm2. The measurement results indicate that the proposed ADDCC can operate between 300 and 600 MHz with an input duty-cycle range of 40-60%, and that the output duty-cycle error is less than 1% with a root-mean-square jitter of 3.86 ps.
NASA Astrophysics Data System (ADS)
Weigel, Martin
2011-09-01
Over the last couple of years it has been realized that the vast computational power of graphics processing units (GPUs) could be harvested for purposes other than the video game industry. This power, which at least nominally exceeds that of current CPUs by large factors, results from the relative simplicity of the GPU architectures as compared to CPUs, combined with a large number of parallel processing units on a single chip. To benefit from this setup for general computing purposes, the problems at hand need to be prepared in a way to profit from the inherent parallelism and hierarchical structure of memory accesses. In this contribution I discuss the performance potential for simulating spin models, such as the Ising model, on GPU as compared to conventional simulations on CPU.
NASA Astrophysics Data System (ADS)
Mohammad, Atif Farid; Straub, Jeremy
2015-05-01
A multi-craft asteroid survey has significant data synchronization needs. Limited communication speeds drive exacting performance requirements. Tables have been used in Relational Databases, which are structure; however, DOMBA (Distributed Objects Management Based Articulation) deals with data in terms of collections. With this, no read/write roadblocks to the data exist. A master/slave architecture is created by utilizing the Gossip protocol. This facilitates expanding a mission that makes an important discovery via the launch of another spacecraft. The Open Space Box Framework facilitates the foregoing while also providing a virtual caching layer to make sure that continuously accessed data is available in memory and that, upon closing the data file, recharging is applied to the data.
Multistage WDM access architecture employing cascaded AWGs
NASA Astrophysics Data System (ADS)
El-Nahal, F. I.; Mears, R. J.
2009-03-01
Here we propose passive/active arrayed waveguide gratings (AWGs) with enhanced performance for system applications mainly in novel access architectures employing cascaded AWG technology. Two technologies were considered to achieve space wavelength switching in these networks. Firstly, a passive AWG with semiconductor optical amplifiers array, and secondly, an active AWG. Active AWG is an AWG with an array of phase modulators on its arrayed-waveguides section, where a programmable linear phase-profile or a phase hologram is applied across the arrayed-waveguide section. This results in a wavelength shift at the output section of the AWG. These architectures can address up to 6912 customers employing only 24 wavelengths, coarsely separated by 1.6 nm. Simulation results obtained here demonstrate that cascaded AWGs access architectures have a great potential in future local area networks. Furthermore, they indicate for the first time that active AWGs architectures are more efficient in routing signals to the destination optical network units than passive AWG architectures.
Array processor architecture connection network
NASA Technical Reports Server (NTRS)
Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)
1982-01-01
A connection network is disclosed for use between a parallel array of processors and a parallel array of memory modules for establishing non-conflicting data communications paths between requested memory modules and requesting processors. The connection network includes a plurality of switching elements interposed between the processor array and the memory modules array in an Omega networking architecture. Each switching element includes a first and a second processor side port, a first and a second memory module side port, and control logic circuitry for providing data connections between the first and second processor ports and the first and second memory module ports. The control logic circuitry includes strobe logic for examining data arriving at the first and the second processor ports to indicate when the data arriving is requesting data from a requesting processor to a requested memory module. Further, connection circuitry is associated with the strobe logic for examining requesting data arriving at the first and the second processor ports for providing a data connection therefrom to the first and the second memory module ports in response thereto when the data connection so provided does not conflict with a pre-established data connection currently in use.
NASA Technical Reports Server (NTRS)
Tick, Evan
1987-01-01
This note describes an efficient software emulator for the Warren Abstract Machine (WAM) Prolog architecture. The version of the WAM implemented is called Lcode. The Lcode emulator, written in C, executes the 'naive reverse' benchmark at 3900 LIPS. The emulator is one of a set of tools used to measure the memory-referencing characteristics and performance of Prolog programs. These tools include a compiler, assembler, and memory simulators. An overview of the Lcode architecture is given here, followed by a description and listing of the emulator code implementing each Lcode instruction. This note will be of special interest to those studying the WAM and its performance characteristics. In general, this note will be of interest to those creating efficient software emulators for abstract machine architectures.
A non-destructive crossbar architecture of multi-level memory-based resistor
NASA Astrophysics Data System (ADS)
Sahebkarkhorasani, Seyedmorteza
Nowadays, researchers are trying to shrink the memory cell in order to increase the capacity of the memory system and reduce the hardware costs. In recent years, there has been a revolution in electronics by using fundamentals of physics to build a new memory for computer application in order to increase the capacity and decrease the power consumption. Increasing the capacity of the memory causes a growth in the chip area. From 1971 to 2012 semiconductor manufacturing process improved from 6mum to 22 mum. In May 2008, S.Williams stated that "it is time to stop shrinking". In his paper, he declared that the process of shrinking memory element has recently become very slow and it is time to use another alternative in order to create memory elements [9]. In this project, we present a new design of a memory array using the new element named Memristor [3]. Memristor is a two-terminal passive electrical element that relates the charge and magnetic flux to each other. The device remained unknown since 1971 when it was discovered by Chua and introduced as the fourth fundamental passive element like capacitor, inductor and resistor [3]. Memristor has a dynamic resistance and it can retain its previous value even after disconnecting the power supply. Due to this interesting behavior of the Memristor, it can be a good replacement for all of the Non-Volatile Memories (NVMs) in the near future. Combination of this newly introduced element with the nanowire crossbar architecture would be a great structure which is called Crossbar Memristor. Some frameworks have recently been introduced in literature that utilized Memristor crossbar array, but there are many challenges to implement the Memristor crossbar array due to fabrication and device limitations. In this work, we proposed a simple design of Memristor crossbar array architecture which uses input feedback in order to preserve its data after each read operation.
Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Williams, Samuel; Carter, Jonathan; Oliker, Leonid
2008-02-01
We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Clovertown, AMD Opteron X2, Sun Niagara2, STI Cell, as well as the single core Intel Itanium2. Rather than hand-tuning LBMHDmore » for each system, we develop a code generator that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned LBMHD application achieves up to a 14x improvement compared with the original code. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.« less
Lattice Boltzmann simulation optimization on leading multicore platforms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Williams, S.; Carter, J.; Oliker, L.
2008-01-01
We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of searchbased performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Clovertown, AMD Opteron X2, Sun Niagara2, STI Cell, as well as the single core Intel Itanium2. Rather than hand-tuning LBMHDmore » for each system, we develop a code generator that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our autotuned LBMHD application achieves up to a 14 improvement compared with the original code. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.« less
High-Throughput, Adaptive FFT Architecture for FPGA-Based Spaceborne Data Processors
NASA Technical Reports Server (NTRS)
NguyenKobayashi, Kayla; Zheng, Jason X.; He, Yutao; Shah, Biren N.
2011-01-01
Exponential growth in microelectronics technology such as field-programmable gate arrays (FPGAs) has enabled high-performance spaceborne instruments with increasing onboard data processing capabilities. As a commonly used digital signal processing (DSP) building block, fast Fourier transform (FFT) has been of great interest in onboard data processing applications, which needs to strike a reasonable balance between high-performance (throughput, block size, etc.) and low resource usage (power, silicon footprint, etc.). It is also desirable to be designed so that a single design can be reused and adapted into instruments with different requirements. The Multi-Pass Wide Kernel FFT (MPWK-FFT) architecture was developed, in which the high-throughput benefits of the parallel FFT structure and the low resource usage of Singleton s single butterfly method is exploited. The result is a wide-kernel, multipass, adaptive FFT architecture. The 32K-point MPWK-FFT architecture includes 32 radix-2 butterflies, 64 FIFOs to store the real inputs, 64 FIFOs to store the imaginary inputs, complex twiddle factor storage, and FIFO logic to route the outputs to the correct FIFO. The inputs are stored in sequential fashion into the FIFOs, and the outputs of each butterfly are sequentially written first into the even FIFO, then the odd FIFO. Because of the order of the outputs written into the FIFOs, the depth of the even FIFOs, which are 768 each, are 1.5 times larger than the odd FIFOs, which are 512 each. The total memory needed for data storage, assuming that each sample is 36 bits, is 2.95 Mbits. The twiddle factors are stored in internal ROM inside the FPGA for fast access time. The total memory size to store the twiddle factors is 589.9Kbits. This FFT structure combines the benefits of high throughput from the parallel FFT kernels and low resource usage from the multi-pass FFT kernels with desired adaptability. Space instrument missions that need onboard FFT capabilities such as the proposed DESDynl, SWOT (Surface Water Ocean Topography), and Europa sounding radar missions would greatly benefit from this technology with significant reductions in non-recurring cost and risk.
NASA Astrophysics Data System (ADS)
Baumann, Erwin W.; Williams, David L.
1993-08-01
Artificial neural networks capable of learning and recalling stochastic associations between non-deterministic quantities have received relatively little attention to date. One potential application of such stochastic associative networks is the generation of sensory 'expectations' based on arbitrary subsets of sensor inputs to support anticipatory and investigate behavior in sensor-based robots. Another application of this type of associative memory is the prediction of how a scene will look in one spectral band, including noise, based upon its appearance in several other wavebands. This paper describes a semi-supervised neural network architecture composed of self-organizing maps associated through stochastic inter-layer connections. This 'Stochastic Associative Memory' (SAM) can learn and recall non-deterministic associations between multi-dimensional probability density functions. The stochastic nature of the network also enables it to represent noise distributions that are inherent in any true sensing process. The SAM architecture, training process, and initial application to sensor image prediction are described. Relationships to Fuzzy Associative Memory (FAM) are discussed.
Architecture of fluid intelligence and working memory revealed by lesion mapping.
Barbey, Aron K; Colom, Roberto; Paul, Erick J; Grafman, Jordan
2014-03-01
Although cognitive neuroscience has made valuable progress in understanding the role of the prefrontal cortex in human intelligence, the functional networks that support adaptive behavior and novel problem solving remain to be well characterized. Here, we studied 158 human brain lesion patients to investigate the cognitive and neural foundations of key competencies for fluid intelligence and working memory. We administered a battery of neuropsychological tests, including the Wechsler Adult Intelligence Scale (WAIS) and the N-Back task. Latent variable modeling was applied to obtain error-free scores of fluid intelligence and working memory, followed by voxel-based lesion-symptom mapping to elucidate their neural substrates. The observed latent variable modeling and lesion results support an integrative framework for understanding the architecture of fluid intelligence and working memory and make specific recommendations for the interpretation and application of the WAIS and N-Back task to the study of fluid intelligence in health and disease.
UPC++ Programmer’s Guide (v1.0 2017.9)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bachan, J.; Baden, S.; Bonachea, D.
UPC++ is a C++11 library that provides Asynchronous Partitioned Global Address Space (APGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The APGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, APGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, allmore » operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.« less
UPC++ Programmer’s Guide, v1.0-2018.3.0
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bachan, J.; Baden, S.; Bonachea, Dan
UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operationsmore » that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.« less
NASA Astrophysics Data System (ADS)
Speidel, Steven
1992-08-01
Our ultimate goal is to develop neural-like cognitive sensory processing within non-neuronal systems. Toward this end, computational models are being developed for selectivity attending the task-relevant parts of composite sensory excitations in an example sound processing application. Significant stimuli partials are selectively attended through the use of generalized neural adaptive beamformers. Computational components are being tested by experiment in the laboratory and also by use of recordings from sensor deployments in the ocean. Results will be presented. These computational components are being integrated into a comprehensive processing architecture that simultaneously attends memory according to stimuli, attends stimuli according to memory, and attends stimuli and memory according to an ongoing thought process. The proposed neural architecture is potentially very fast when implemented in special hardware.
Efficient Synthesis of Graph Methods: a Dynamically Scheduled Architecture
DOE Office of Scientific and Technical Information (OSTI.GOV)
Minutoli, Marco; Castellana, Vito G.; Tumeo, Antonino
RDF databases naturally map to a graph representation and employ languages, such as SPARQL, that implements queries as graph pattern matching routines. Graph methods exhibit an irregular behavior: they present unpredictable, fine-grained data accesses, and are synchronization inten- sive. Graph data structures expose large amounts of dy- namic parallelism, but are difficult to partition without gen- erating load unbalance. In this paper, we present a novel ar- chitecture to improve the synthesis of graph methods. Our design addresses the issues of these algorithms with two com- ponents: a Dynamic Task Scheduler (DTS), which reduces load unbalance and maximize resource utilization,more » and a Hi- erarchical Memory Interface controller (HMI), which pro- vides support for concurrent memory operations on multi- ported/multi-banked shared memories. We evaluate our ap- proach by generating the accelerators for a set of SPARQL queries from the Lehigh University Benchmark (LUBM). We first analyze the load unbalance of these queries, showing that execution time among tasks can differ even of order of magnitudes. We then synthesize the queries and com- pare the performance of the resulting accelerators against the current state of the art. Experimental results show that our solution provides a speedup over the serial implementa- tion close to the theoretical maximum and a speedup up to 3.45 over a baseline parallel implementation. We conclude our study by exploring the design space to achieve maximum memory channels utilization. The best design used at least three of the four memory channels for more than 90% of the execution time.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gala, Alan; Ohmacht, Martin
A multiprocessor system includes nodes. Each node includes a data path that includes a core, a TLB, and a first level cache implementing disambiguation. The system also includes at least one second level cache and a main memory. For thread memory access requests, the core uses an address associated with an instruction format of the core. The first level cache uses an address format related to the size of the main memory plus an offset corresponding to hardware thread meta data. The second level cache uses a physical main memory address plus software thread meta data to store the memorymore » access request. The second level cache accesses the main memory using the physical address with neither the offset nor the thread meta data after resolving speculation. In short, this system includes mapping of a virtual address to a different physical addresses for value disambiguation for different threads.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Braiman, Yehuda; Neschke, Brendan; Nair, Niketh S.
Here, we study memory states of a circuit consisting of a small inductively coupled Josephson junction array and introduce basic (write, read, and reset) memory operations logics of the circuit. The presented memory operation paradigm is fundamentally different from conventional single quantum flux operation logics. We calculate stability diagrams of the zero-voltage states and outline memory states of the circuit. We also calculate access times and access energies for basic memory operations.
Architecture for robot intelligence
NASA Technical Reports Server (NTRS)
Peters, II, Richard Alan (Inventor)
2004-01-01
An architecture for robot intelligence enables a robot to learn new behaviors and create new behavior sequences autonomously and interact with a dynamically changing environment. Sensory information is mapped onto a Sensory Ego-Sphere (SES) that rapidly identifies important changes in the environment and functions much like short term memory. Behaviors are stored in a DBAM that creates an active map from the robot's current state to a goal state and functions much like long term memory. A dream state converts recent activities stored in the SES and creates or modifies behaviors in the DBAM.
Content-addressable read/write memories for image analysis
NASA Technical Reports Server (NTRS)
Snyder, W. E.; Savage, C. D.
1982-01-01
The commonly encountered image analysis problems of region labeling and clustering are found to be cases of search-and-rename problem which can be solved in parallel by a system architecture that is inherently suitable for VLSI implementation. This architecture is a novel form of content-addressable memory (CAM) which provides parallel search and update functions, allowing speed reductions down to constant time per operation. It has been proposed in related investigations by Hall (1981) that, with VLSI, CAM-based structures with enhanced instruction sets for general purpose processing will be feasible.
Is random access memory random?
NASA Technical Reports Server (NTRS)
Denning, P. J.
1986-01-01
Most software is contructed on the assumption that the programs and data are stored in random access memory (RAM). Physical limitations on the relative speeds of processor and memory elements lead to a variety of memory organizations that match processor addressing rate with memory service rate. These include interleaved and cached memory. A very high fraction of a processor's address requests can be satified from the cache without reference to the main memory. The cache requests information from main memory in blocks that can be transferred at the full memory speed. Programmers who organize algorithms for locality can realize the highest performance from these computers.
Effects of cacheing on multitasking efficiency and programming strategy on an ELXSI 6400
DOE Office of Scientific and Technical Information (OSTI.GOV)
Montry, G.R.; Benner, R.E.
1985-12-01
The impact of a cache/shared memory architecture, and, in particular, the cache coherency problem, upon concurrent algorithm and program development is discussed. In this context, a simple set of programming strategies are proposed which streamline code development and improve code performance when multitasking in a cache/shared memory or distributed memory environment.
Advanced computer architecture specification for automated weld systems
NASA Technical Reports Server (NTRS)
Katsinis, Constantine
1994-01-01
This report describes the requirements for an advanced automated weld system and the associated computer architecture, and defines the overall system specification from a broad perspective. According to the requirements of welding procedures as they relate to an integrated multiaxis motion control and sensor architecture, the computer system requirements are developed based on a proven multiple-processor architecture with an expandable, distributed-memory, single global bus architecture, containing individual processors which are assigned to specific tasks that support sensor or control processes. The specified architecture is sufficiently flexible to integrate previously developed equipment, be upgradable and allow on-site modifications.
NASA Technical Reports Server (NTRS)
Kanerva, P.
1986-01-01
To determine the relation of the sparse, distributed memory to other architectures, a broad review of the literature was made. The memory is called a pattern memory because they work with large patterns of features (high-dimensional vectors). A pattern is stored in a pattern memory by distributing it over a large number of storage elements and by superimposing it over other stored patterns. A pattern is retrieved by mathematical or statistical reconstruction from the distributed elements. Three pattern memories are discussed.
Investigation of fast initialization of spacecraft bubble memory systems
NASA Technical Reports Server (NTRS)
Looney, K. T.; Nichols, C. D.; Hayes, P. J.
1984-01-01
Bubble domain technology offers significant improvement in reliability and functionality for spacecraft onboard memory applications. In considering potential memory systems organizations, minimization of power in high capacity bubble memory systems necessitates the activation of only the desired portions of the memory. In power strobing arbitrary memory segments, a capability of fast turn on is required. Bubble device architectures, which provide redundant loop coding in the bubble devices, limit the initialization speed. Alternate initialization techniques are investigated to overcome this design limitation. An initialization technique using a small amount of external storage is demonstrated.
Workflow of the Grover algorithm simulation incorporating CUDA and GPGPU
NASA Astrophysics Data System (ADS)
Lu, Xiangwen; Yuan, Jiabin; Zhang, Weiwei
2013-09-01
The Grover quantum search algorithm, one of only a few representative quantum algorithms, can speed up many classical algorithms that use search heuristics. No true quantum computer has yet been developed. For the present, simulation is one effective means of verifying the search algorithm. In this work, we focus on the simulation workflow using a compute unified device architecture (CUDA). Two simulation workflow schemes are proposed. These schemes combine the characteristics of the Grover algorithm and the parallelism of general-purpose computing on graphics processing units (GPGPU). We also analyzed the optimization of memory space and memory access from this perspective. We implemented four programs on CUDA to evaluate the performance of schemes and optimization. Through experimentation, we analyzed the organization of threads suited to Grover algorithm simulations, compared the storage costs of the four programs, and validated the effectiveness of optimization. Experimental results also showed that the distinguished program on CUDA outperformed the serial program of libquantum on a CPU with a speedup of up to 23 times (12 times on average), depending on the scale of the simulation.
Mironov, Vladimir; Moskovsky, Alexander; D’Mello, Michael; ...
2017-10-04
The Hartree-Fock (HF) method in the quantum chemistry package GAMESS represents one of the most irregular algorithms in computation today. Major steps in the calculation are the irregular computation of electron repulsion integrals (ERIs) and the building of the Fock matrix. These are the central components of the main Self Consistent Field (SCF) loop, the key hotspot in Electronic Structure (ES) codes. By threading the MPI ranks in the official release of the GAMESS code, we not only speed up the main SCF loop (4x to 6x for large systems), but also achieve a significant (>2x) reduction in the overallmore » memory footprint. These improvements are a direct consequence of memory access optimizations within the MPI ranks. We benchmark our implementation against the official release of the GAMESS code on the Intel R Xeon PhiTM supercomputer. Here, scaling numbers are reported on up to 7,680 cores on Intel Xeon Phi coprocessors.« less
Programmable computing with a single magnetoresistive element
NASA Astrophysics Data System (ADS)
Ney, A.; Pampuch, C.; Koch, R.; Ploog, K. H.
2003-10-01
The development of transistor-based integrated circuits for modern computing is a story of great success. However, the proved concept for enhancing computational power by continuous miniaturization is approaching its fundamental limits. Alternative approaches consider logic elements that are reconfigurable at run-time to overcome the rigid architecture of the present hardware systems. Implementation of parallel algorithms on such `chameleon' processors has the potential to yield a dramatic increase of computational speed, competitive with that of supercomputers. Owing to their functional flexibility, `chameleon' processors can be readily optimized with respect to any computer application. In conventional microprocessors, information must be transferred to a memory to prevent it from getting lost, because electrically processed information is volatile. Therefore the computational performance can be improved if the logic gate is additionally capable of storing the output. Here we describe a simple hardware concept for a programmable logic element that is based on a single magnetic random access memory (MRAM) cell. It combines the inherent advantage of a non-volatile output with flexible functionality which can be selected at run-time to operate as an AND, OR, NAND or NOR gate.
ERIC Educational Resources Information Center
Brooks, Kenneth W.
The guide details characteristics to provide architecturally accessible special education programs for handicapped students. Impetus for the accessibility movement is traced to legislation, including the Architectural Barriers Act and Sections 503 and 504 of the Rehabilitation Act of 1973. Planning features considered are the development of a…
A generalized LSTM-like training algorithm for second-order recurrent neural networks
Monner, Derek; Reggia, James A.
2011-01-01
The Long Short Term Memory (LSTM) is a second-order recurrent neural network architecture that excels at storing sequential short-term memories and retrieving them many time-steps later. LSTM’s original training algorithm provides the important properties of spatial and temporal locality, which are missing from other training approaches, at the cost of limiting it’s applicability to a small set of network architectures. Here we introduce the Generalized Long Short-Term Memory (LSTM-g) training algorithm, which provides LSTM-like locality while being applicable without modification to a much wider range of second-order network architectures. With LSTM-g, all units have an identical set of operating instructions for both activation and learning, subject only to the configuration of their local environment in the network; this is in contrast to the original LSTM training algorithm, where each type of unit has its own activation and training instructions. When applied to LSTM architectures with peephole connections, LSTM-g takes advantage of an additional source of back-propagated error which can enable better performance than the original algorithm. Enabled by the broad architectural applicability of LSTM-g, we demonstrate that training recurrent networks engineered for specific tasks can produce better results than single-layer networks. We conclude that LSTM-g has the potential to both improve the performance and broaden the applicability of spatially and temporally local gradient-based training algorithms for recurrent neural networks. PMID:21803542
A mission operations architecture for the 21st century
NASA Technical Reports Server (NTRS)
Tai, W.; Sweetnam, D.
1996-01-01
An operations architecture is proposed for low cost missions beyond the year 2000. The architecture consists of three elements: a service based architecture; a demand access automata; and distributed science hubs. The service based architecture is based on a set of standard multimission services that are defined, packaged and formalized by the deep space network and the advanced multi-mission operations system. The demand access automata is a suite of technologies which reduces the need to be in contact with the spacecraft, and thus reduces operating costs. The beacon signaling, the virtual emergency room, and the high efficiency tracking automata technologies are described. The distributed science hubs provide information system capabilities to the small science oriented flight teams: individual access to all traditional mission functions and services; multimedia intra-team communications, and automated direct transparent communications between the scientists and the instrument.
Price, John M; Colflesh, Gregory J H; Cerella, John; Verhaeghen, Paul
2014-05-01
We investigated the effects of 10h of practice on variations of the N-Back task to investigate the processes underlying possible expansion of the focus of attention within working memory. Using subtractive logic, we showed that random access (i.e., Sternberg-like search) yielded a modest effect (a 50% increase in speed) whereas the processes of forward access (i.e., retrieval in order, as in a standard N-Back task) and updating (i.e., changing the contents of working memory) were executed about 5 times faster after extended practice. We additionally found that extended practice increased working memory capacity as measured by the size of the focus of attention for the forward-access task, but not for variations where probing was in random order. This suggests that working memory capacity may depend on the type of search process engaged, and that certain working-memory-related cognitive processes are more amenable to practice than others. Copyright © 2014 Elsevier B.V. All rights reserved.
On the Efficacy of Source Code Optimizations for Cache-Based Systems
NASA Technical Reports Server (NTRS)
VanderWijngaart, Rob F.; Saphir, William C.
1998-01-01
Obtaining high performance without machine-specific tuning is an important goal of scientific application programmers. Since most scientific processing is done on commodity microprocessors with hierarchical memory systems, this goal of "portable performance" can be achieved if a common set of optimization principles is effective for all such systems. It is widely believed, or at least hoped, that portable performance can be realized. The rule of thumb for optimization on hierarchical memory systems is to maximize temporal and spatial locality of memory references by reusing data and minimizing memory access stride. We investigate the effects of a number of optimizations on the performance of three related kernels taken from a computational fluid dynamics application. Timing the kernels on a range of processors, we observe an inconsistent and often counterintuitive impact of the optimizations on performance. In particular, code variations that have a positive impact on one architecture can have a negative impact on another, and variations expected to be unimportant can produce large effects. Moreover, we find that cache miss rates - as reported by a cache simulation tool, and confirmed by hardware counters - only partially explain the results. By contrast, the compiler-generated assembly code provides more insight by revealing the importance of processor-specific instructions and of compiler maturity, both of which strongly, and sometimes unexpectedly, influence performance. We conclude that it is difficult to obtain performance portability on modern cache-based computers, and comment on the implications of this result.
On the Efficacy of Source Code Optimizations for Cache-Based Systems
NASA Technical Reports Server (NTRS)
VanderWijngaart, Rob F.; Saphir, William C.; Saini, Subhash (Technical Monitor)
1998-01-01
Obtaining high performance without machine-specific tuning is an important goal of scientific application programmers. Since most scientific processing is done on commodity microprocessors with hierarchical memory systems, this goal of "portable performance" can be achieved if a common set of optimization principles is effective for all such systems. It is widely believed, or at least hoped, that portable performance can be realized. The rule of thumb for optimization on hierarchical memory systems is to maximize temporal and spatial locality of memory references by reusing data and minimizing memory access stride. We investigate the effects of a number of optimizations on the performance of three related kernels taken from a computational fluid dynamics application. Timing the kernels on a range of processors, we observe an inconsistent and often counterintuitive impact of the optimizations on performance. In particular, code variations that have a positive impact on one architecture can have a negative impact on another, and variations expected to be unimportant can produce large effects. Moreover, we find that cache miss rates-as reported by a cache simulation tool, and confirmed by hardware counters-only partially explain the results. By contrast, the compiler-generated assembly code provides more insight by revealing the importance of processor-specific instructions and of compiler maturity, both of which strongly, and sometimes unexpectedly, influence performance. We conclude that it is difficult to obtain performance portability on modern cache-based computers, and comment on the implications of this result.
NASA Technical Reports Server (NTRS)
Hendry, David F. (Inventor)
1993-01-01
In a data system having a memory, plural input/output (I/O) devices and a bus connecting each of the I/O devices to the memory, a direct memory access (DMA) controller regulating access of each of the I/O devices to the bus, including a priority register storing priorities of bus access requests from the I/O devices, an interrupt register storing bus access requests of the I/O devices, a resolver for selecting one of the I/O devices to have access to the bus, a pointer register storing addresses of locations in the memory for communication with the one I/O device via the bus, a sequence register storing an address of a location in the memory containing a channel program instruction which is to be executed next, an ALU for incrementing and decrementing addresses stored in the pointer register, computing the next address to be stored in the sequence register, computing an initial contents of each of the register. The memory contains a sequence of channel program instructions defining a set up operation wherein the contents of each of the registers in the channel register is initialized in accordance with the initial contents computed by the ALU and an access operation wherein data is transferred on the bus between a location in the memory whose address is currently stored in the pointer register and the one I/O device enabled by the resolver.
zorder-lib: Library API for Z-Order Memory Layout
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nowell, Lucy; Edward W. Bethel
2015-04-01
This document describes the motivation for, elements of, and use of the zorder-lib, a library API that implements organization of and access to data in memory using either a-order (also known as "row-major" order) or z-order memory layouts. The primary motivation for this work is to improve the performance of many types of data- intensive codes by increasing both spatial and temporal locality of memory accesses. The basic idea is that the cost associated with accessing a datum is less when it is nearby in either space or time.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Murphy, Richard C.
2009-09-01
This report details the accomplishments of the 'Building More Powerful Less Expensive Supercomputers Using Processing-In-Memory (PIM)' LDRD ('PIM LDRD', number 105809) for FY07-FY09. Latency dominates all levels of supercomputer design. Within a node, increasing memory latency, relative to processor cycle time, limits CPU performance. Between nodes, the same increase in relative latency impacts scalability. Processing-In-Memory (PIM) is an architecture that directly addresses this problem using enhanced chip fabrication technology and machine organization. PIMs combine high-speed logic and dense, low-latency, high-bandwidth DRAM, and lightweight threads that tolerate latency by performing useful work during memory transactions. This work examines the potential ofmore » PIM-based architectures to support mission critical Sandia applications and an emerging class of more data intensive informatics applications. This work has resulted in a stronger architecture/implementation collaboration between 1400 and 1700. Additionally, key technology components have impacted vendor roadmaps, and we are in the process of pursuing these new collaborations. This work has the potential to impact future supercomputer design and construction, reducing power and increasing performance. This final report is organized as follow: this summary chapter discusses the impact of the project (Section 1), provides an enumeration of publications and other public discussion of the work (Section 1), and concludes with a discussion of future work and impact from the project (Section 1). The appendix contains reprints of the refereed publications resulting from this work.« less
REUSABLE PROPULSION ARCHITECTURE FOR SUSTAINABLE LOW-COST ACCESS TO SPACE
NASA Technical Reports Server (NTRS)
Bonometti, J. A.; Dankanich, J. W.; Frame, K. L.
2005-01-01
The primary obstacle to any space-based mission is, and has always been, the cost of access to space. Even with impressive efforts toward reusability, no system has come close to lowering the cost a significant amount. It is postulated here, that architectural innovation is necessary to make reusability feasible, not incremental subsystem changes. This paper shows two architectural approaches of reusability that merit further study investments. Both #inherently# have performance increases and cost advantages to make affordable access to space a near term reality. A rocket launched from a subsonic aircraft (specifically the Crossbow methodology) and a momentum exchange tether, reboosted by electrodynamics, offer possibilities of substantial reductions in the total transportation architecture mass - making access-to-space cost-effective. They also offer intangible benefits that reduce risk or offer large growth potential. The cost analysis indicates that approximately a 50% savings is obtained using today#s aerospace materials and practices.
Efficient accesses of data structures using processing near memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jayasena, Nuwan S.; Zhang, Dong Ping; Diez, Paula Aguilera
Systems, apparatuses, and methods for implementing efficient queues and other data structures. A queue may be shared among multiple processors and/or threads without using explicit software atomic instructions to coordinate access to the queue. System software may allocate an atomic queue and corresponding queue metadata in system memory and return, to the requesting thread, a handle referencing the queue metadata. Any number of threads may utilize the handle for accessing the atomic queue. The logic for ensuring the atomicity of accesses to the atomic queue may reside in a management unit in the memory controller coupled to the memory wheremore » the atomic queue is allocated.« less
Programming model for distributed intelligent systems
NASA Technical Reports Server (NTRS)
Sztipanovits, J.; Biegl, C.; Karsai, G.; Bogunovic, N.; Purves, B.; Williams, R.; Christiansen, T.
1988-01-01
A programming model and architecture which was developed for the design and implementation of complex, heterogeneous measurement and control systems is described. The Multigraph Architecture integrates artificial intelligence techniques with conventional software technologies, offers a unified framework for distributed and shared memory based parallel computational models and supports multiple programming paradigms. The system can be implemented on different hardware architectures and can be adapted to strongly different applications.
Influence of cued-fear conditioning and its impairment on NREM sleep.
Kumar, Tankesh; Jha, Sushil K
2017-10-01
Many studies suggest that fear conditioning influences sleep. It is, however, not known if the changes in sleep architecture after fear conditioning are essentially associated with the consolidation of fearful memory or with fear itself. Here, we have observed that within sleep, NREM sleep consistently remained augmented after the consolidation of cued fear-conditioned memory. But a similar change did not occur after impairing memory consolidation by blocking new protein synthesis and glutamate transmission between glial-neuronal loop in the lateral amygdala (LA). Anisomycin (a protein synthesis inhibitor) and DL-α-amino-adipic acid (DL- α -AA) (a glial glutamine synthetase enzyme inhibitor) were microinjected into the LA soon after cued fear-conditioning to induce memory impairment. On the post-conditioning day, animals in both the groups exhibited significantly less freezing. In memory-consolidated groups (vehicle groups), NREM sleep significantly increased during 2nd to 5th hours after training compared to their baseline days. However, in memory impaired groups (anisomycin and DL- α -AA microinjected groups), similar changes were not observed. Our results thus suggest that changes in sleep architecture after cued fear-conditioning are indeed a consolidation dependent event. Copyright © 2017 Elsevier Inc. All rights reserved.
Working memory capacity and controlled serial memory search.
Mızrak, Eda; Öztekin, Ilke
2016-08-01
The speed-accuracy trade-off (SAT) procedure was used to investigate the relationship between working memory capacity (WMC) and the dynamics of temporal order memory retrieval. High- and low-span participants (HSs, LSs) studied sequentially presented five-item lists, followed by two probes from the study list. Participants indicated the more recent probe. Overall, accuracy was higher for HSs compared to LSs. Crucially, in contrast to previous investigations that observed no impact of WMC on speed of access to item information in memory (e.g., Öztekin & McElree, 2010), recovery of temporal order memory was slower for LSs. While accessing an item's representation in memory can be direct, recovery of relational information such as temporal order information requires a more controlled serial memory search. Collectively, these data indicate that WMC effects are particularly prominent during high demands of cognitive control, such as serial search operations necessary to access temporal order information from memory. Copyright © 2016 Elsevier B.V. All rights reserved.
Saying what’s on your mind: Working memory effects on sentence production
Slevc, L. Robert
2011-01-01
The role of working memory (WM) in sentence comprehension has received considerable interest, but little work has investigated how sentence production relies on memory mechanisms. These three experiments investigated speakers’ tendency to produce syntactic structures that allow for early production of material that is accessible in memory. In Experiment 1, speakers produced accessible information early less often when under a verbal WM load than when under no load. Experiment 2 found the same pattern for given-new ordering, i.e., when accessibility was manipulated by making information given. Experiment 3 addressed the possibility that these effects do not reflect WM mechanisms but rather increased task difficulty by relying on the distinction between verbal and spatial WM: Speakers’ tendency to produce sentences respecting given-new ordering was reduced more by a verbal than by a spatial WM load. These patterns show that accessibility effects do in fact reflect accessibility in verbal WM, and that representations in sentence production are vulnerable to interference from other information in memory. PMID:21767058
Working memory at work: how the updating process alters the nature of working memory transfer.
Zhang, Yanmin; Verhaeghen, Paul; Cerella, John
2012-01-01
In three N-Back experiments, we investigated components of the process of working memory (WM) updating, more specifically access to items stored outside the focus of attention and transfer from the focus to the region of WM outside the focus. We used stimulus complexity as a marker. We found that when WM transfer occurred under full attention, it was slow and highly sensitive to stimulus complexity, much more so than WM access. When transfer occurred in conjunction with access, however, it was fast and no longer sensitive to stimulus complexity. Thus the updating context altered the nature of WM processing: The dual-task situation (transfer in conjunction with access) drove memory transfer into a more efficient mode, indifferent to stimulus complexity. In contrast, access times consistently increased with complexity, unaffected by the processing context. This study reinforces recent reports that retrieval is a (perhaps the) key component of working memory functioning. Copyright © 2011 Elsevier B.V. All rights reserved.
Working Memory at Work: How the Updating Process Alters the Nature of Working Memory Transfer
Zhang, Yanmin; Verhaeghen, Paul; Cerella, John
2011-01-01
In three N-Back experiments, we investigated components of the process of working memory (WM) updating, more specifically access to items stored outside the focus of attention and transfer from the focus to the region of WM outside the focus. We used stimulus complexity as a marker. We found that when WM transfer occurred under full attention, it was slow and highly sensitive to stimulus complexity, much more so than WM access. When transfer occurred in conjunction with access, however, it was fast and no longer sensitive to stimulus complexity. Thus the updating context altered the nature of WM processing: The dual-task situation (transfer in conjunction with access) drove memory transfer into a more efficient mode, indifferent to stimulus complexity. In contrast, access times consistently increased with complexity, unaffected by the processing context. This study reinforces recent reports that retrieval is a (perhaps the) key component of working memory functioning. PMID:22105718
ERIC Educational Resources Information Center
Gerhardt, Lillian N.
1981-01-01
Evaluates the Prince George's County Memorial Public Library's approach to providing access to its services for children, and examines policies, regulations, practices, and conditions that affect such access. Six references are cited. (FM)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nelson, Andrew F.; Wetzstein, M.; Naab, T.
2009-10-01
We continue our presentation of VINE. In this paper, we begin with a description of relevant architectural properties of the serial and shared memory parallel computers on which VINE is intended to run, and describe their influences on the design of the code itself. We continue with a detailed description of a number of optimizations made to the layout of the particle data in memory and to our implementation of a binary tree used to access that data for use in gravitational force calculations and searches for smoothed particle hydrodynamics (SPH) neighbor particles. We describe the modifications to the codemore » necessary to obtain forces efficiently from special purpose 'GRAPE' hardware, the interfaces required to allow transparent substitution of those forces in the code instead of those obtained from the tree, and the modifications necessary to use both tree and GRAPE together as a fused GRAPE/tree combination. We conclude with an extensive series of performance tests, which demonstrate that the code can be run efficiently and without modification in serial on small workstations or in parallel using the OpenMP compiler directives on large-scale, shared memory parallel machines. We analyze the effects of the code optimizations and estimate that they improve its overall performance by more than an order of magnitude over that obtained by many other tree codes. Scaled parallel performance of the gravity and SPH calculations, together the most costly components of most simulations, is nearly linear up to at least 120 processors on moderate sized test problems using the Origin 3000 architecture, and to the maximum machine sizes available to us on several other architectures. At similar accuracy, performance of VINE, used in GRAPE-tree mode, is approximately a factor 2 slower than that of VINE, used in host-only mode. Further optimizations of the GRAPE/host communications could improve the speed by as much as a factor of 3, but have not yet been implemented in VINE. Finally, we find that although parallel performance on small problems may reach a plateau beyond which more processors bring no additional speedup, performance never decreases, a factor important for running large simulations on many processors with individual time steps, where only a small fraction of the total particles require updates at any given moment.« less
Nonvolatile memory chips: critical technology for high-performance recce systems
NASA Astrophysics Data System (ADS)
Kaufman, Bruce
2000-11-01
Airborne recce systems universally require nonvolatile storage of recorded data. Both present and next generation designs make use of flash memory chips. Flash memory devices are in high volume use for a variety of commercial products ranging form cellular phones to digital cameras. Fortunately, commercial applications call for increasing capacities and fast write times. These parameters are important to the designer of recce recorders. Of economic necessity COTS devices are used in recorders that must perform in military avionics environments. Concurrently, recording rates are moving to $GTR10Gb/S. Thus to capture imagery for even a few minutes of record time, tactically meaningful solid state recorders will require storage capacities in the 100s of Gbytes. Even with memory chip densities at present day 512Mb, such capacities require thousands of chips. The demands on packaging technology are daunting. This paper will consider the differing flash chip architectures, both available and projected and discuss the impact on recorder architecture and performance. Emerging nonvolatile memory technologies, FeRAM AND MIRAM will be reviewed with regard to their potential use in recce recorders.
Sheldon, Signy; Chu, Sonja
2017-09-01
Autobiographical memory research has investigated how cueing distinct aspects of a past event can trigger different recollective experiences. This research has stimulated theories about how autobiographical knowledge is accessed and organized. Here, we test the idea that thematic information organizes multiple autobiographical events whereas spatial information organizes individual past episodes by investigating how retrieval guided by these two forms of information differs. We used a novel autobiographical fluency task in which participants accessed multiple memory exemplars to event theme and spatial (location) cues followed by a narrative description task in which they described the memories generated to these cues. Participants recalled significantly more memory exemplars to event theme than to spatial cues; however, spatial cues prompted faster access to past memories. Results from the narrative description task revealed that memories retrieved via event theme cues compared to spatial cues had a higher number of overall details, but those recalled to the spatial cues were recollected with a greater concentration on episodic details than those retrieved via event theme cues. These results provide evidence that thematic information organizes and integrates multiple memories whereas spatial information prompts the retrieval of specific episodic content from a past event.
NASA Astrophysics Data System (ADS)
Wang, Jianhua; Cheng, Lianglun; Wang, Tao; Peng, Xiaodong
2016-03-01
Table look-up operation plays a very important role during the decoding processing of context-based adaptive variable length decoding (CAVLD) in H.264/advanced video coding (AVC). However, frequent table look-up operation can result in big table memory access, and then lead to high table power consumption. Aiming to solve the problem of big table memory access of current methods, and then reduce high power consumption, a memory-efficient table look-up optimized algorithm is presented for CAVLD. The contribution of this paper lies that index search technology is introduced to reduce big memory access for table look-up, and then reduce high table power consumption. Specifically, in our schemes, we use index search technology to reduce memory access by reducing the searching and matching operations for code_word on the basis of taking advantage of the internal relationship among length of zero in code_prefix, value of code_suffix and code_lengh, thus saving the power consumption of table look-up. The experimental results show that our proposed table look-up algorithm based on index search can lower about 60% memory access consumption compared with table look-up by sequential search scheme, and then save much power consumption for CAVLD in H.264/AVC.
Development of Curie point switching for thin film, random access, memory device
NASA Technical Reports Server (NTRS)
Lewicki, G. W.; Tchernev, D. I.
1967-01-01
Managanese bismuthide films are used in the development of a random access memory device of high packing density and nondestructive readout capability. Memory entry is by Curie point switching using a laser beam. Readout is accomplished by microoptical or micromagnetic scanning.
Application of Astronomical Compositions in Small Architectural Forms
NASA Astrophysics Data System (ADS)
Haykazun, Ani
2016-12-01
The small architectural forms are an important part of the Armenian architecture. Their compositions are diverse including quadrihedral structures, cross-stones, monuments, gravestones, memorial stones, etc. From ancient times to the late middle ages, and up to themodern small architectural forms, there are many decorative elements of astronomical character. Among them, one can more often see stars, the sun, the moon, the sky, the planets, the sign of eternity and other symbolic decorative images, which play a major role in the formation of the artistic image of the architectural compositions. The analysis of application of astronomical compositions will help more comprehensively introduce the compositional peculiarities of the small architectural forms.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhang, Yao; Balaprakash, Prasanna; Meng, Jiayuan
We present Raexplore, a performance modeling framework for architecture exploration. Raexplore enables rapid, automated, and systematic search of architecture design space by combining hardware counter-based performance characterization and analytical performance modeling. We demonstrate Raexplore for two recent manycore processors IBM Blue- Gene/Q compute chip and Intel Xeon Phi, targeting a set of scientific applications. Our framework is able to capture complex interactions between architectural components including instruction pipeline, cache, and memory, and to achieve a 3–22% error for same-architecture and cross-architecture performance predictions. Furthermore, we apply our framework to assess the two processors, and discover and evaluate a list ofmore » architectural scaling options for future processor designs.« less
Tehan, G; Lalor, D M
2000-11-01
Rehearsal speed has traditionally been seen to be the prime determinant of individual differences in memory span. Recent studies, in the main using young children as the subject population, have suggested other contributors to span performance, notably contributions from long-term memory and forgetting and retrieval processes occurring during recall. In the current research we explore individual differences in span with respect to measures of rehearsal, output time, and access to lexical memory. We replicate standard short-term phenomena; we show that the variables that influence children's span performance influence adult performance in the same way; and we show that lexical memory access appears to be a more potent source of individual differences in span than either rehearsal speed or output factors.
Adult Age Differences in Accessing and Retrieving Information from Long-Term Memory.
ERIC Educational Resources Information Center
Petros, Thomas V.; And Others
1983-01-01
Investigated adult age differences in accessing and retrieving information from long-term memory. Results showed that older adults (N=26) were slower than younger adults (N=35) at feature extraction, lexical access, and accessing category information. The age deficit was proportionally greater when retrieval of category information was required.…
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fischler, M.
1992-04-01
The issues to be addressed here are those of balance'' in machine architecture. By this, we mean how much emphasis must be placed on various aspects of the system to maximize its usefulness for physics. There are three components that contribute to the utility of a system: How the machine can be used, how big a problem can be attacked, and what the effective capabilities (power) of the hardware are like. The effective power issue is a matter of evaluating the impact of design decisions trading off architectural features such as memory bandwidth and interprocessor communication capabilities. What is studiedmore » is the effect these machine parameters have on how quickly the system can solve desired problems. There is a reasonable method for studying this: One selects a few representative algorithms and computes the impact of changing memory bandwidths, and so forth. The only room for controversy here is in the selection of representative problems. The issue of how big a problem can be attacked boils down to a balance of memory size versus power. Although this is a balance issue it is very different than the effective power situation, because no firm answer can be given at this time. The power to memory ratio is highly problem dependent, and optimizing it requires several pieces of physics input, including: how big a lattice is needed for interesting results; what sort of algorithms are best to use; and how many sweeps are needed to get valid results. We seem to be at the threshold of learning things about these issues, but for now, the memory size issue will necessarily be addressed in terms of best guesses, rules of thumb, and researchers' opinions.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fischler, M.
1992-04-01
The issues to be addressed here are those of ``balance`` in machine architecture. By this, we mean how much emphasis must be placed on various aspects of the system to maximize its usefulness for physics. There are three components that contribute to the utility of a system: How the machine can be used, how big a problem can be attacked, and what the effective capabilities (power) of the hardware are like. The effective power issue is a matter of evaluating the impact of design decisions trading off architectural features such as memory bandwidth and interprocessor communication capabilities. What is studiedmore » is the effect these machine parameters have on how quickly the system can solve desired problems. There is a reasonable method for studying this: One selects a few representative algorithms and computes the impact of changing memory bandwidths, and so forth. The only room for controversy here is in the selection of representative problems. The issue of how big a problem can be attacked boils down to a balance of memory size versus power. Although this is a balance issue it is very different than the effective power situation, because no firm answer can be given at this time. The power to memory ratio is highly problem dependent, and optimizing it requires several pieces of physics input, including: how big a lattice is needed for interesting results; what sort of algorithms are best to use; and how many sweeps are needed to get valid results. We seem to be at the threshold of learning things about these issues, but for now, the memory size issue will necessarily be addressed in terms of best guesses, rules of thumb, and researchers` opinions.« less
The storage and recall of auditory memory.
Nebenzahl, I; Albeck, Y
1990-01-01
The architecture of the auditory memory is investigated. The auditory information is assumed to be represented by f-t patterns. With the help of a psycho-physical experiment it is demonstrated that the storage of these patterns is highly folded in the sense that a long signal is broken into many short stretches before being stored in the memory. Recognition takes place by correlating newly heard input in the short term memory to information previously stored in the long term memory. We show that this correlation is performed after the input is accumulated and held statically in the short term memory.
A Report of Bethune-Cookman College NASA JOVE Projects
NASA Technical Reports Server (NTRS)
Agba, Lawrence C.; David, Sunil K.; Rao, Narsing G.; Rahmani, Munir A.
1997-01-01
This document is the final report for the Joint Venture (JOVE) in Space Sciences, and describes the tasks, performed with the support of the contract. These tasks include work in: (1) interfacing microprocessor systems to high performance parallel interface chips, SCSI drive and memory, needed for the implementation of a Space Optical Data Recorder; (2) designing a digital interface architecture for a microprocessor controlled sensors monitoring unit for a NASA Jitter Attenuation and Dynamics Experiment (JADE) project; (3) developing an enhanced back-propagation training algorithm; (4) studying the effect of simulated spaceflight on Aortic Contractility; (5) developing a course in astronomy; and (6) improving internet access by running cables, and installing hubs in various places on the campus; and (7) researching the characteristics of Nd:YALO laser resonator.
Intuitive Intelligence, Self-regulation, and Lifting Consciousness
Zayas, Maria
2014-01-01
This article explores the role of the heart in emotional experience, as well as how learning to shift the rhythms of the heart into a more coherent state makes it possible to establish a new inner baseline reference that allows access to our heart’s intuitive capacities and deeper wisdom. The nature and types of intuition and the connection between intuition and compassionate action are discussed. It is suggested that increased effectiveness in self-regulatory capacity and the resultant reorganization of memories sustained in the neural architecture facilitates a stable and integrated experience of self in relationship to others and to the environment, otherwise known as consciousness. The implications of meeting the increasingly complex demands of life with greater love, compassion, and kindness, thereby lifting consciousness, are considered. PMID:24808982
Neurocognitive architecture of working memory
Eriksson, Johan; Vogel, Edward K.; Lansner, Anders; Bergström, Fredrik; Nyberg, Lars
2015-01-01
The crucial role of working memory for temporary information processing and guidance of complex behavior has been recognized for many decades. There is emerging consensus that working memory maintenance results from the interactions among long-term memory representations and basic processes, including attention, that are instantiated as reentrant loops between frontal and posterior cortical areas, as well as subcortical structures. The nature of such interactions can account for capacity limitations, lifespan changes, and restricted transfer after working-memory training. Recent data and models indicate that working memory may also be based on synaptic plasticity, and that working memory can operate on non-consciously perceived information. PMID:26447571
Weather prediction using a genetic memory
NASA Technical Reports Server (NTRS)
Rogers, David
1990-01-01
Kanaerva's sparse distributed memory (SDM) is an associative memory model based on the mathematical properties of high dimensional binary address spaces. Holland's genetic algorithms are a search technique for high dimensional spaces inspired by evolutional processes of DNA. Genetic Memory is a hybrid of the above two systems, in which the memory uses a genetic algorithm to dynamically reconfigure its physical storage locations to reflect correlations between the stored addresses and data. This architecture is designed to maximize the ability of the system to scale-up to handle real world problems.
An Novel Architecture of Large-scale Communication in IOT
NASA Astrophysics Data System (ADS)
Ma, Wubin; Deng, Su; Huang, Hongbin
2018-03-01
In recent years, many scholars have done a great deal of research on the development of Internet of Things and networked physical systems. However, few people have made the detailed visualization of the large-scale communications architecture in the IOT. In fact, the non-uniform technology between IPv6 and access points has led to a lack of broad principles of large-scale communications architectures. Therefore, this paper presents the Uni-IPv6 Access and Information Exchange Method (UAIEM), a new architecture and algorithm that addresses large-scale communications in the IOT.
78 FR 49248 - Passenger Vessels Accessibility Guidelines
Federal Register 2010, 2011, 2012, 2013, 2014
2013-08-13
... ARCHITECTURAL AND TRANSPORTATION BARRIERS COMPLIANCE BOARD 36 CFR Part 1196 [Docket No. ATBCB-2013... Barriers Compliance Board. ACTION: Notice of proposed rulemaking; extension of comment period. SUMMARY: The Architectural and Transportation Barriers Compliance Board (Access Board) is extending until January 24, 2014...
On the Design of a Comprehensive Authorisation Framework for Service Oriented Architecture (SOA)
2013-07-01
Authentication Server AZM Authorisation Manager AZS Authorisation Server BP Business Process BPAA Business Process Authorisation Architecture BPAD Business...Internet Protocol Security JAAS Java Authentication and Authorisation Service MAC Mandatory Access Control RBAC Role Based Access Control RCA Regional...the authentication process, make authorisation decisions using application specific access control functions that results in the practice of
NASA Astrophysics Data System (ADS)
Hirono, Masahiko; Nojima, Toshio
This paper presents a new signaling architecture for radio-access control in wireless communications systems. Called THREP (for THREe-phase link set-up Process), it enables systems with low-cost configurations to provide tetherless access and wide-ranging mobility by using autonomous radio-link controls for fast cell searching and distributed call management. A signaling architecture generally consists of a radio-access part and a service-entity-access part. In THREP, the latter part is divided into two steps: preparing a communication channel, and sustaining it. Access control in THREP is thus composed of three separated parts, or protocol phases. The specifications of each phase are determined independently according to system requirements. In the proposed architecture, the first phase uses autonomous radio-link control because we want to construct low-power indoor wireless communications systems. Evaluation of channel usage efficiency and hand-over loss probability in the personal handy-phone system (PHS) shows that THREP makes the radio-access sub-system operations in a practical application model highly efficient, and the results of a field experiment show that THREP provides sufficient protection against severe fast CNR degradation in practical indoor propagation environments.
ERIC Educational Resources Information Center
Gulliford, Andrew
The book examines the one-room schoolhouse and the memories of this important part of the American past through sections on the country school legacy, country school architecture, and country school preservation. The architectural and historical significance of this distinctive building type is evocatively portrayed by more than 400 photographs.…
Radiation Effects of Commercial Resistive Random Access Memories
NASA Technical Reports Server (NTRS)
Chen, Dakai; LaBel, Kenneth A.; Berg, Melanie; Wilcox, Edward; Kim, Hak; Phan, Anthony; Figueiredo, Marco; Buchner, Stephen; Khachatrian, Ani; Roche, Nicolas
2014-01-01
We present results for the single-event effect response of commercial production-level resistive random access memories. We found that the resistive memory arrays are immune to heavy ion-induced upsets. However, the devices were susceptible to single-event functional interrupts, due to upsets from the control circuits. The intrinsic radiation tolerant nature of resistive memory makes the technology an attractive consideration for future space applications.
NASA Technical Reports Server (NTRS)
Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)
1983-01-01
A high speed parallel array data processing architecture fashioned under a computational envelope approach includes a data base memory for secondary storage of programs and data, and a plurality of memory modules interconnected to a plurality of processing modules by a connection network of the Omega gender. Programs and data are fed from the data base memory to the plurality of memory modules and from hence the programs are fed through the connection network to the array of processors (one copy of each program for each processor). Execution of the programs occur with the processors operating normally quite independently of each other in a multiprocessing fashion. For data dependent operations and other suitable operations, all processors are instructed to finish one given task or program branch before all are instructed to proceed in parallel processing fashion on the next instruction. Even when functioning in the parallel processing mode however, the processors are not locked-step but execute their own copy of the program individually unless or until another overall processor array synchronization instruction is issued.
Impact of memory bottleneck on the performance of graphics processing units
NASA Astrophysics Data System (ADS)
Son, Dong Oh; Choi, Hong Jun; Kim, Jong Myon; Kim, Cheol Hong
2015-12-01
Recent graphics processing units (GPUs) can process general-purpose applications as well as graphics applications with the help of various user-friendly application programming interfaces (APIs) supported by GPU vendors. Unfortunately, utilizing the hardware resource in the GPU efficiently is a challenging problem, since the GPU architecture is totally different to the traditional CPU architecture. To solve this problem, many studies have focused on the techniques for improving the system performance using GPUs. In this work, we analyze the GPU performance varying GPU parameters such as the number of cores and clock frequency. According to our simulations, the GPU performance can be improved by 125.8% and 16.2% on average as the number of cores and clock frequency increase, respectively. However, the performance is saturated when memory bottleneck problems incur due to huge data requests to the memory. The performance of GPUs can be improved as the memory bottleneck is reduced by changing GPU parameters dynamically.
Sleep-dependent memory consolidation in healthy aging and mild cognitive impairment.
Pace-Schott, Edward F; Spencer, Rebecca M C
2015-01-01
Sleep quality and architecture as well as sleep's homeostatic and circadian controls change with healthy aging. Changes include reductions in slow-wave sleep's (SWS) percent and spectral power in the sleep electroencephalogram (EEG), number and amplitude of sleep spindles, rapid eye movement (REM) density and the amplitude of circadian rhythms, as well as a phase advance (moved earlier in time) of the brain's circadian clock. With mild cognitive impairment (MCI) there are further reductions of sleep quality, SWS, spindles, and percent REM, all of which further diminish, along with a profound disruption of circadian rhythmicity, with the conversion to Alzheimer's disease (AD). Sleep disorders may represent risk factors for dementias (e.g., REM Behavior Disorder presages Parkinson's disease) and sleep disorders are themselves extremely prevalent in neurodegenerative diseases. Working memory , formation of new episodic memories, and processing speed all decline with healthy aging whereas semantic, recognition, and emotional declarative memory are spared. In MCI, episodic and working memory further decline along with declines in semantic memory. In young adults, sleep-dependent memory consolidation (SDC) is widely observed for both declarative and procedural memory tasks. However, with healthy aging, although SDC for declarative memory is preserved, certain procedural tasks, such as motor-sequence learning, do not show SDC. In younger adults, fragmentation of sleep can reduce SDC, and a normative increase in sleep fragmentation may account for reduced SDC with healthy aging. Whereas sleep disorders such as insomnia, obstructive sleep apnea, and narcolepsy can impair SDC in the absence of neurodegenerative changes, the incidence of sleep disorders increases both with normal aging and, further, with neurodegenerative disease. Specific features of sleep architecture, such as sleep spindles and SWS are strongly linked to SDC. Diminution of these features with healthy aging and their further decline with MCI may account for concomitant declines in SDC. Notably these same sleep features further markedly decline, in concert with declining cognitive function, with the progression to AD. Therefore, progressive changes in sleep quality, architecture, and neural regulation may constitute a contributing factor to cognitive decline that is seen both with healthy aging and, to a much greater extent, with neurodegenerative disease.
NASA Technical Reports Server (NTRS)
Chien, E. S. K.; Marinho, J. A.; Russell, J. E., Sr.
1988-01-01
The Cellular Access Digital Network (CADN) is the access vehicle through which cellular technology is brought into the mainstream of the evolving integrated telecommunications network. Beyond the integrated end-to-end digital access and per call network services provisioning of the Integrated Services Digital Network (ISDN), the CADN engenders the added capability of mobility freedom via wireless access. One key element of the CADN network architecture is the standard user to network interface that is independent of RF transmission technology. Since the Mobile Satellite System (MSS) is envisioned to not only complement but also enhance the capabilities of the terrestrial cellular telecommunications network, compatibility and interoperability between terrestrial cellular and mobile satellite systems are vitally important to provide an integrated moving telecommunications network of the future. From a network standpoint, there exist very strong commonalities between the terrestrial cellular system and the mobile satellite system. Therefore, the MSS architecture should be designed as an integral part of the CADN. This paper describes the concept of the CADN, the functional architecture of the MSS, and the user-network interface signaling protocols.
Accessibility Limits Recall from Visual Working Memory
ERIC Educational Resources Information Center
Rajsic, Jason; Swan, Garrett; Wilson, Daryl E.; Pratt, Jay
2017-01-01
In this article, we demonstrate limitations of accessibility of information in visual working memory (VWM). Recently, cued-recall has been used to estimate the fidelity of information in VWM, where the feature of a cued object is reproduced from memory (Bays, Catalao, & Husain, 2009; Wilken & Ma, 2004; Zhang & Luck, 2008). Response…
Physical principles and current status of emerging non-volatile solid state memories
NASA Astrophysics Data System (ADS)
Wang, L.; Yang, C.-H.; Wen, J.
2015-07-01
Today the influence of non-volatile solid-state memories on persons' lives has become more prominent because of their non-volatility, low data latency, and high robustness. As a pioneering technology that is representative of non-volatile solidstate memories, flash memory has recently seen widespread application in many areas ranging from electronic appliances, such as cell phones and digital cameras, to external storage devices such as universal serial bus (USB) memory. Moreover, owing to its large storage capacity, it is expected that in the near future, flash memory will replace hard-disk drives as a dominant technology in the mass storage market, especially because of recently emerging solid-state drives. However, the rapid growth of the global digital data has led to the need for flash memories to have larger storage capacity, thus requiring a further downscaling of the cell size. Such a miniaturization is expected to be extremely difficult because of the well-known scaling limit of flash memories. It is therefore necessary to either explore innovative technologies that can extend the areal density of flash memories beyond the scaling limits, or to vigorously develop alternative non-volatile solid-state memories including ferroelectric random-access memory, magnetoresistive random-access memory, phase-change random-access memory, and resistive random-access memory. In this paper, we review the physical principles of flash memories and their technical challenges that affect our ability to enhance the storage capacity. We then present a detailed discussion of novel technologies that can extend the storage density of flash memories beyond the commonly accepted limits. In each case, we subsequently discuss the physical principles of these new types of non-volatile solid-state memories as well as their respective merits and weakness when utilized for data storage applications. Finally, we predict the future prospects for the aforementioned solid-state memories for the next generation of data-storage devices based on a comparison of their performance. [Figure not available: see fulltext.
NASA Astrophysics Data System (ADS)
Han, Yishi; Luo, Zhixiao; Wang, Jianhua; Min, Zhixuan; Qin, Xinyu; Sun, Yunlong
2014-09-01
In general, context-based adaptive variable length coding (CAVLC) decoding in H.264/AVC standard requires frequent access to the unstructured variable length coding tables (VLCTs) and significant memory accesses are consumed. Heavy memory accesses will cause high power consumption and time delays, which are serious problems for applications in portable multimedia devices. We propose a method for high-efficiency CAVLC decoding by using a program instead of all the VLCTs. The decoded codeword from VLCTs can be obtained without any table look-up and memory access. The experimental results show that the proposed algorithm achieves 100% memory access saving and 40% decoding time saving without degrading video quality. Additionally, the proposed algorithm shows a better performance compared with conventional CAVLC decoding, such as table look-up by sequential search, table look-up by binary search, Moon's method, and Kim's method.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chin, George; Marquez, Andres; Choudhury, Sutanay
2012-09-01
Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less
NASA Astrophysics Data System (ADS)
Navlakha, Nupur; Kranti, Abhinav
2017-11-01
The work reports on the use of a planar tri-gate tunnel field effect transistor (TFET) to operate as dynamic memory at 85 °C with an enhanced sense margin (SM). Two symmetric gates (G1) aligned to the source at a partial region of intrinsic film result into better electrostatic control that regulates the read mechanism based on band-to-band tunneling, while the other gate (G2), positioned adjacent to the first front gate is responsible for charge storage and sustenance. The proposed architecture results in an enhanced SM of ˜1.2 μA μm-1 along with a longer retention time (RT) of ˜1.8 s at 85 °C, for a total length of 600 nm. The double gate architecture towards the source increases the tunneling current and also reduces short channel effects, enhancing SM and scalability, thereby overcoming the critical bottleneck faced by TFET based dynamic memories. The work also discusses the impact of overlap/underlap and interface charges on the performance of TFET based dynamic memory. Insights into device operation demonstrate that the choice of appropriate architecture and biases not only limit the trade-off between SM and RT, but also result in improved scalability with drain voltage and total length being scaled down to 0.8 V and 115 nm, respectively.
Critical Branches and Lucky Loads in Control-Independence Architectures
ERIC Educational Resources Information Center
Malik, Kshitiz
2009-01-01
Branch mispredicts have a first-order impact on the performance of integer applications. Control Independence (CI) architectures aim to overlap the penalties of mispredicted branches with useful execution by spawning control-independent work as separate threads. Although control independent, such threads may consume register and memory values…
Remote hardware-reconfigurable robotic camera
NASA Astrophysics Data System (ADS)
Arias-Estrada, Miguel; Torres-Huitzil, Cesar; Maya-Rueda, Selene E.
2001-10-01
In this work, a camera with integrated image processing capabilities is discussed. The camera is based on an imager coupled to an FPGA device (Field Programmable Gate Array) which contains an architecture for real-time computer vision low-level processing. The architecture can be reprogrammed remotely for application specific purposes. The system is intended for rapid modification and adaptation for inspection and recognition applications, with the flexibility of hardware and software reprogrammability. FPGA reconfiguration allows the same ease of upgrade in hardware as a software upgrade process. The camera is composed of a digital imager coupled to an FPGA device, two memory banks, and a microcontroller. The microcontroller is used for communication tasks and FPGA programming. The system implements a software architecture to handle multiple FPGA architectures in the device, and the possibility to download a software/hardware object from the host computer into its internal context memory. System advantages are: small size, low power consumption, and a library of hardware/software functionalities that can be exchanged during run time. The system has been validated with an edge detection and a motion processing architecture, which will be presented in the paper. Applications targeted are in robotics, mobile robotics, and vision based quality control.
Test and Evaluation of Architecture-Aware Compiler Environment
2011-11-01
biology, medicine, social sciences , and security applications. Challenges include extremely large graphs (the Facebook friend network has over...Operations with Temporal Binning ....................................................................... 32 4.12 Memory behavior and Energy per...five challenge problems empirically, exploring their scaling properties, computation and datatype needs, memory behavior , and temporal behavior
Memory Abilities in Williams Syndrome: Dissociation or Developmental Delay Hypothesis?
ERIC Educational Resources Information Center
Sampaio, Adriana; Sousa, Nuno; Fernandez, Montse; Henriques, Margarida; Goncalves, Oscar F.
2008-01-01
Williams syndrome (WS) is a neurodevelopmental genetic disorder often described as being characterized by a dissociative cognitive architecture, in which profound impairments of visuo-spatial cognition contrast with relative preservation of linguistic, face recognition and auditory short-memory abilities. This asymmetric and dissociative cognition…
SALT: The Simulator for the Analysis of LWP Timing
NASA Technical Reports Server (NTRS)
Springer, Paul L.; Rodrigues, Arun; Brockman, Jay
2006-01-01
With the emergence of new processor architectures that are highly multithreaded, and support features such as full/empty memory semantics and split-phase memory transactions, the need for a processor simulator to handle these features becomes apparent. This paper describes such a simulator, called SALT.
Parallel reduced-instruction-set-computer architecture for real-time symbolic pattern matching
NASA Astrophysics Data System (ADS)
Parson, Dale E.
1991-03-01
This report discusses ongoing work on a parallel reduced-instruction- set-computer (RISC) architecture for automatic production matching. The PRIOPS compiler takes advantage of the memoryless character of automatic processing by translating a program's collection of automatic production tests into an equivalent combinational circuit-a digital circuit without memory, whose outputs are immediate functions of its inputs. The circuit provides a highly parallel, fine-grain model of automatic matching. The compiler then maps the combinational circuit onto RISC hardware. The heart of the processor is an array of comparators capable of testing production conditions in parallel, Each comparator attaches to private memory that contains virtual circuit nodes-records of the current state of nodes and busses in the combinational circuit. All comparator memories hold identical information, allowing simultaneous update for a single changing circuit node and simultaneous retrieval of different circuit nodes by different comparators. Along with the comparator-based logic unit is a sequencer that determines the current combination of production-derived comparisons to try, based on the combined success and failure of previous combinations of comparisons. The memoryless nature of automatic matching allows the compiler to designate invariant memory addresses for virtual circuit nodes, and to generate the most effective sequences of comparison test combinations. The result is maximal utilization of parallel hardware, indicating speed increases and scalability beyond that found for course-grain, multiprocessor approaches to concurrent Rete matching. Future work will consider application of this RISC architecture to the standard (controlled) Rete algorithm, where search through memory dominates portions of matching.
Persistent Memory in Single Node Delay-Coupled Reservoir Computing.
Kovac, André David; Koall, Maximilian; Pipa, Gordon; Toutounji, Hazem
2016-01-01
Delays are ubiquitous in biological systems, ranging from genetic regulatory networks and synaptic conductances, to predator/pray population interactions. The evidence is mounting, not only to the presence of delays as physical constraints in signal propagation speed, but also to their functional role in providing dynamical diversity to the systems that comprise them. The latter observation in biological systems inspired the recent development of a computational architecture that harnesses this dynamical diversity, by delay-coupling a single nonlinear element to itself. This architecture is a particular realization of Reservoir Computing, where stimuli are injected into the system in time rather than in space as is the case with classical recurrent neural network realizations. This architecture also exhibits an internal memory which fades in time, an important prerequisite to the functioning of any reservoir computing device. However, fading memory is also a limitation to any computation that requires persistent storage. In order to overcome this limitation, the current work introduces an extended version to the single node Delay-Coupled Reservoir, that is based on trained linear feedback. We show by numerical simulations that adding task-specific linear feedback to the single node Delay-Coupled Reservoir extends the class of solvable tasks to those that require nonfading memory. We demonstrate, through several case studies, the ability of the extended system to carry out complex nonlinear computations that depend on past information, whereas the computational power of the system with fading memory alone quickly deteriorates. Our findings provide the theoretical basis for future physical realizations of a biologically-inspired ultrafast computing device with extended functionality.
Persistent Memory in Single Node Delay-Coupled Reservoir Computing
Pipa, Gordon; Toutounji, Hazem
2016-01-01
Delays are ubiquitous in biological systems, ranging from genetic regulatory networks and synaptic conductances, to predator/pray population interactions. The evidence is mounting, not only to the presence of delays as physical constraints in signal propagation speed, but also to their functional role in providing dynamical diversity to the systems that comprise them. The latter observation in biological systems inspired the recent development of a computational architecture that harnesses this dynamical diversity, by delay-coupling a single nonlinear element to itself. This architecture is a particular realization of Reservoir Computing, where stimuli are injected into the system in time rather than in space as is the case with classical recurrent neural network realizations. This architecture also exhibits an internal memory which fades in time, an important prerequisite to the functioning of any reservoir computing device. However, fading memory is also a limitation to any computation that requires persistent storage. In order to overcome this limitation, the current work introduces an extended version to the single node Delay-Coupled Reservoir, that is based on trained linear feedback. We show by numerical simulations that adding task-specific linear feedback to the single node Delay-Coupled Reservoir extends the class of solvable tasks to those that require nonfading memory. We demonstrate, through several case studies, the ability of the extended system to carry out complex nonlinear computations that depend on past information, whereas the computational power of the system with fading memory alone quickly deteriorates. Our findings provide the theoretical basis for future physical realizations of a biologically-inspired ultrafast computing device with extended functionality. PMID:27783690
Bäuml, Karl-Heinz T; Dobler, Ina M
2015-01-01
Depending on the degree to which the original study context is accessible, selective memory retrieval can be detrimental or beneficial for the recall of other memories (Bäuml & Samenieh, 2012). Prior work has shown that the detrimental effect of memory retrieval is typically recall specific and does not arise after restudy trials, whereas recall specificity of the beneficial effect has not been examined to date. Addressing the issue, we compared in 2 experiments the effects of retrieval and restudy on recall of other items, when access to the study context was (largely) maintained and when access to the study context was impaired (in Experiment 1 by using the listwise directed-forgetting task, in Experiment 2 by using a prolonged retention interval). In both experiments, selective retrieval but not restudy induced forgetting of other items when context access was maintained, which replicates prior work. In contrast, when context access was impaired, both selective retrieval and restudy induced beneficial effects on other memories. These findings suggest that the detrimental but not the beneficial effect of selective memory retrieval is recall specific. The results are consistent with a recent 2-factor account of selective memory retrieval that attributes the detrimental effect to inhibition or blocking but the beneficial effect to context reactivation processes. PsycINFO Database Record (c) 2015 APA, all rights reserved.
Low latency memory access and synchronization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.
A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processormore » only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.« less
Low latency memory access and synchronization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.
A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Bach processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processormore » only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.« less
Federal Register 2010, 2011, 2012, 2013, 2014
2012-08-20
... ARCHITECTURAL AND TRANSPORTATION BARRIERS COMPLIANCE BOARD 36 CFR Part 1192 [Docket No. ATBCB 2010-0004] RIN 3014-AA38 Americans With Disabilities Act (ADA) Accessibility Guidelines for Transportation Vehicles AGENCY: Architectural and Transportation Barriers Compliance Board. ACTION: Notice of public...
Design of a system based on DSP and FPGA for video recording and replaying
NASA Astrophysics Data System (ADS)
Kang, Yan; Wang, Heng
2013-08-01
This paper brings forward a video recording and replaying system with the architecture of Digital Signal Processor (DSP) and Field Programmable Gate Array (FPGA). The system achieved encoding, recording, decoding and replaying of Video Graphics Array (VGA) signals which are displayed on a monitor during airplanes and ships' navigating. In the architecture, the DSP is a main processor which is used for a large amount of complicated calculation during digital signal processing. The FPGA is a coprocessor for preprocessing video signals and implementing logic control in the system. In the hardware design of the system, Peripheral Device Transfer (PDT) function of the External Memory Interface (EMIF) is utilized to implement seamless interface among the DSP, the synchronous dynamic RAM (SDRAM) and the First-In-First-Out (FIFO) in the system. This transfer mode can avoid the bottle-neck of the data transfer and simplify the circuit between the DSP and its peripheral chips. The DSP's EMIF and two level matching chips are used to implement Advanced Technology Attachment (ATA) protocol on physical layer of the interface of an Integrated Drive Electronics (IDE) Hard Disk (HD), which has a high speed in data access and does not rely on a computer. Main functions of the logic on the FPGA are described and the screenshots of the behavioral simulation are provided in this paper. In the design of program on the DSP, Enhanced Direct Memory Access (EDMA) channels are used to transfer data between the FIFO and the SDRAM to exert the CPU's high performance on computing without intervention by the CPU and save its time spending. JPEG2000 is implemented to obtain high fidelity in video recording and replaying. Ways and means of acquiring high performance for code are briefly present. The ability of data processing of the system is desirable. And smoothness of the replayed video is acceptable. By right of its design flexibility and reliable operation, the system based on DSP and FPGA for video recording and replaying has a considerable perspective in analysis after the event, simulated exercitation and so forth.
Plated wire random access memories
NASA Technical Reports Server (NTRS)
Gouldin, L. D.
1975-01-01
A program was conducted to construct 4096-work by 18-bit random access, NDRO-plated wire memory units. The memory units were subjected to comprehensive functional and environmental tests at the end-item level to verify comformance with the specified requirements. A technical description of the unit is given, along with acceptance test data sheets.
The Dynamics of Access to Groups in Working Memory
ERIC Educational Resources Information Center
Farrell, Simon; Lelievre, Anna
2012-01-01
The finding that participants leave a pause between groups when attempting serial recall of temporally grouped lists has been taken to indicate access to a hierarchical representation of the list in working memory. An alternative explanation is that the dynamics of serial recall solely reflect output (rather than memorial) processes, with the…
Nanometric summation architecture based on optical near-field interaction between quantum dots.
Naruse, Makoto; Miyazaki, Tetsuya; Kubota, Fumito; Kawazoe, Tadashi; Kobayashi, Kiyoshi; Sangu, Suguru; Ohtsu, Motoichi
2005-01-15
A nanoscale data summation architecture is proposed and experimentally demonstrated based on the optical near-field interaction between quantum dots. Based on local electromagnetic interactions between a few nanometric elements via optical near fields, we can combine multiple excitations at a certain quantum dot, which allows construction of a summation architecture. Summation plays a key role for content-addressable memory, which is one of the most important functions in optical networks.
Performance of FORTRAN floating-point operations on the Flex/32 multicomputer
NASA Technical Reports Server (NTRS)
Crockett, Thomas W.
1987-01-01
A series of experiments has been run to examine the floating-point performance of FORTRAN programs on the Flex/32 (Trademark) computer. The experiments are described, and the timing results are presented. The time required to execute a floating-point operation is found to vary considerbaly depending on a number of factors. One factor of particular interest from an algorithm design standpoint is the difference in speed between common memory accesses and local memory accesses. Common memory accesses were found to be slower, and guidelines are given for determinig when it may be cost effective to copy data from common to local memory.
High speed optical object recognition processor with massive holographic memory
NASA Technical Reports Server (NTRS)
Chao, T.; Zhou, H.; Reyes, G.
2002-01-01
Real-time object recognition using a compact grayscale optical correlator will be introduced. A holographic memory module for storing a large bank of optimum correlation filters, to accommodate the large data throughput rate needed for many real-world applications, has also been developed. System architecture of the optical processor and the holographic memory will be presented. Application examples of this object recognition technology will also be demonstrated.
Expert system shell to reason on large amounts of data
NASA Technical Reports Server (NTRS)
Giuffrida, Gionanni
1994-01-01
The current data base management systems (DBMS's) do not provide a sophisticated environment to develop rule based expert systems applications. Some of the new DBMS's come with some sort of rule mechanism; these are active and deductive database systems. However, both of these are not featured enough to support full implementation based on rules. On the other hand, current expert system shells do not provide any link with external databases. That is, all the data are kept in the system working memory. Such working memory is maintained in main memory. For some applications the reduced size of the available working memory could represent a constraint for the development. Typically these are applications which require reasoning on huge amounts of data. All these data do not fit into the computer main memory. Moreover, in some cases these data can be already available in some database systems and continuously updated while the expert system is running. This paper proposes an architecture which employs knowledge discovering techniques to reduce the amount of data to be stored in the main memory; in this architecture a standard DBMS is coupled with a rule-based language. The data are stored into the DBMS. An interface between the two systems is responsible for inducing knowledge from the set of relations. Such induced knowledge is then transferred to the rule-based language working memory.
NASA Astrophysics Data System (ADS)
Haron, Adib; Mahdzair, Fazren; Luqman, Anas; Osman, Nazmie; Junid, Syed Abdul Mutalib Al
2018-03-01
One of the most significant constraints of Von Neumann architecture is the limited bandwidth between memory and processor. The cost to move data back and forth between memory and processor is considerably higher than the computation in the processor itself. This architecture significantly impacts the Big Data and data-intensive application such as DNA analysis comparison which spend most of the processing time to move data. Recently, the in-memory processing concept was proposed, which is based on the capability to perform the logic operation on the physical memory structure using a crossbar topology and non-volatile resistive-switching memristor technology. This paper proposes a scheme to map digital equality comparator circuit on memristive memory crossbar array. The 2-bit, 4-bit, 8-bit, 16-bit, 32-bit, and 64-bit of equality comparator circuit are mapped on memristive memory crossbar array by using material implication logic in a sequential and parallel method. The simulation results show that, for the 64-bit word size, the parallel mapping exhibits 2.8× better performance in total execution time than sequential mapping but has a trade-off in terms of energy consumption and area utilization. Meanwhile, the total crossbar area can be reduced by 1.2× for sequential mapping and 1.5× for parallel mapping both by using the overlapping technique.
NASA Technical Reports Server (NTRS)
2013-01-01
Topics covered include: Radial Internal Material Handling System (RIMS) for Circular Habitat Volumes; Conical Seat Shut-Off Valve; Impact-Actuated Digging Tool for Lunar Excavation; Flexible Mechanical Conveyors for Regolith Extraction and Transport; Remote Memory Access Protocol Target Node Intellectual Property; Soft Decision Analyzer; Distributed Prognostics and Health Management with a Wireless Network Architecture; Minimal Power Latch for Single-Slope ADCs; Bismuth Passivation Technique for High-Resolution X-Ray Detectors; High-Strength, Super-elastic Compounds; Cu-Cr-Nb-Zr Alloy for Rocket Engines and Other High-Heat- Flux Applications; Microgravity Storage Vessels and Conveying-Line Feeders for Cohesive Regolith; CRUQS: A Miniature Fine Sun Sensor for Nanosatellites; On-Chip Microfluidic Components for In Situ Analysis, Separation, and Detection of Amino Acids; Spectroscopic Determination of Trace Contaminants in High-Purity Oxygen; Method of Separating Oxygen From Spacecraft Cabin Air to Enable Extravehicular Activities; Atomic Force Microscope Mediated Chromatography; Sample Analysis at Mars Instrument Simulator; Access Control of Web- and Java-Based Applications; Tool for Automated Retrieval of Generic Event Tracks (TARGET); Bilayer Protograph Codes for Half-Duplex Relay Channels; Influence of Computational Drop Representation in LES of a Droplet-Laden Mixing Layer.
gpuPOM: a GPU-based Princeton Ocean Model
NASA Astrophysics Data System (ADS)
Xu, S.; Huang, X.; Zhang, Y.; Fu, H.; Oey, L.-Y.; Xu, F.; Yang, G.
2014-11-01
Rapid advances in the performance of the graphics processing unit (GPU) have made the GPU a compelling solution for a series of scientific applications. However, most existing GPU acceleration works for climate models are doing partial code porting for certain hot spots, and can only achieve limited speedup for the entire model. In this work, we take the mpiPOM (a parallel version of the Princeton Ocean Model) as our starting point, design and implement a GPU-based Princeton Ocean Model. By carefully considering the architectural features of the state-of-the-art GPU devices, we rewrite the full mpiPOM model from the original Fortran version into a new Compute Unified Device Architecture C (CUDA-C) version. We take several accelerating methods to further improve the performance of gpuPOM, including optimizing memory access in a single GPU, overlapping communication and boundary operations among multiple GPUs, and overlapping input/output (I/O) between the hybrid Central Processing Unit (CPU) and the GPU. Our experimental results indicate that the performance of the gpuPOM on a workstation containing 4 GPUs is comparable to a powerful cluster with 408 CPU cores and it reduces the energy consumption by 6.8 times.
Multiscale Interactive Communication: Inside and Outside Thun Castle
NASA Astrophysics Data System (ADS)
Massari, G. A.; Luce, F.; Pellegatta, C.
2011-09-01
The applications of informatics to architecture have become, for professionals, a great tool for managing analytical phases and project activities but also, for the general public, new ways of communication that may relate directly present, past and future facts. Museums in historic buildings, their installations and the recent experiences of eco-museums located throughout the territory provide a privileged experimentation field for technical and digital representation. On the one hand, the safeguarding and the functional adaptation of buildings use 3D computer graphics models that are real spatially related databases: in them are ordered, viewed and interpreted the results of archival, artistic-historical, diagnostic, technological-structural studies and the assumption and feasibility of interventions. On the other hand, the disclosure of things and knowledge linked to collective memory relies on interactive maps and hypertext systems that provide access to authentic virtual museums; a sort of multimedia extension of the exhibition hall is produced to an architectural scale, but at landscape scale the result is an instrument of cultural development so far unpublished: works that are separated in direct perception find in a zenith view of the map a synthetic relation, related both to spatial parameters and temporal interpretations.
Optimization of a Lattice Boltzmann Computation on State-of-the-Art Multicore Platforms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Williams, Samuel; Carter, Jonathan; Oliker, Leonid
2009-04-10
We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Xeon E5345 (Clovertown), AMD Opteron 2214 (Santa Rosa), AMD Opteron 2356 (Barcelona), Sun T5140 T2+ (Victoria Falls), as well asmore » a QS20 IBM Cell Blade. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned LBMHD application achieves up to a 15x improvement compared with the original code at a given concurrency. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.« less
Nguyen, Tuan-Anh; Nakib, Amir; Nguyen, Huy-Nam
2016-06-01
The Non-local means denoising filter has been established as gold standard for image denoising problem in general and particularly in medical imaging due to its efficiency. However, its computation time limited its applications in real world application, especially in medical imaging. In this paper, a distributed version on parallel hybrid architecture is proposed to solve the computation time problem and a new method to compute the filters' coefficients is also proposed, where we focused on the implementation and the enhancement of filters' parameters via taking the neighborhood of the current voxel more accurately into account. In terms of implementation, our key contribution consists in reducing the number of shared memory accesses. The different tests of the proposed method were performed on the brain-web database for different levels of noise. Performances and the sensitivity were quantified in terms of speedup, peak signal to noise ratio, execution time, the number of floating point operations. The obtained results demonstrate the efficiency of the proposed method. Moreover, the implementation is compared to that of other techniques, recently published in the literature. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Does Tracing Worked Examples Enhance Geometry Learning?
ERIC Educational Resources Information Center
Hu, Fang-Tzu; Ginns, Paul; Bobis, Janette
2014-01-01
Cognitive load theory seeks to generate novel instructional designs through a focus on human cognitive architecture including a limited working memory; however, the potential for enhancing learning through non-visual or non-auditory working memory channels is yet to be evaluated. This exploratory experiment tested whether explicit instructions to…
Improving the effectiveness of an interruption lag by inducing a memory-based strategy.
Morgan, Phillip L; Patrick, John; Tiley, Leyanne
2013-01-01
The memory for goals model (Altmann & Trafton, 2002) posits the importance of a short delay (the 'interruption lag') before an interrupting task to encode suspended goals for retrieval post-interruption. Two experiments used the theory of soft constraints (Gray, Simms, Fu & Schoelles, 2006) to investigate whether the efficacy of an interruption lag could be improved by increasing goal-state access cost to induce a more memory-based encoding strategy. Both experiments used a copying task with three access cost conditions (Low, Medium, and High) and a 5-s interruption lag with a no lag control condition. Experiment 1 found that the participants in the High access cost condition resumed more interrupted trials and executed more actions correctly from memory when coupled with an interruption lag. Experiment 2 used a prospective memory test post-interruption and an eyetracker recorded gaze activity during the interruption lag. The participants in the High access cost condition with an interruption lag were best at encoding target information during the interruption lag, evidenced by higher scores on the prospective memory measure and more gaze activity on the goal-state during the interruption lag. Theoretical and practical issues regarding the use of goal-state access cost and an interruption lag are discussed. Copyright © 2012. Published by Elsevier B.V.
ARCHITECTURAL FLOOR PLAN OF PROCESS AND ACCESS AREAS HOT PILOT ...
ARCHITECTURAL FLOOR PLAN OF PROCESS AND ACCESS AREAS HOT PILOT PLANT (CPP-640). INL DRAWING NUMBER 200-0640-00-279-111679. ALTERNATE ID NUMBER 8952-CPP-640-A-2. - Idaho National Engineering Laboratory, Idaho Chemical Processing Plant, Fuel Reprocessing Complex, Scoville, Butte County, ID
Bilinearity, Rules, and Prefrontal Cortex
Dayan, Peter
2007-01-01
Humans can be instructed verbally to perform computationally complex cognitive tasks; their performance then improves relatively slowly over the course of practice. Many skills underlie these abilities; in this paper, we focus on the particular question of a uniform architecture for the instantiation of habitual performance and the storage, recall, and execution of simple rules. Our account builds on models of gated working memory, and involves a bilinear architecture for representing conditional input-output maps and for matching rules to the state of the input and working memory. We demonstrate the performance of our model on two paradigmatic tasks used to investigate prefrontal and basal ganglia function. PMID:18946523
Programmable Direct-Memory-Access Controller
NASA Technical Reports Server (NTRS)
Hendry, David F.
1990-01-01
Proposed programmable direct-memory-access controller (DMAC) operates with computer systems of 32000 series, which have 32-bit data buses and use addresses of 24 (or potentially 32) bits. Controller functions with or without help of central processing unit (CPU) and starts itself. Includes such advanced features as ability to compare two blocks of memory for equality and to search block of memory for specific value. Made as single very-large-scale integrated-circuit chip.
Flexible Endian Adjustment for Cross Architecture Binary Translation
NASA Astrophysics Data System (ADS)
Zhu, Tong; Liu, Bo; Guan, Haibing; Liang, Alei
Different architectures and/or ISA (Instruction Set Architecture) representations hold different data arranging formats in the memory. Therefore, the adjustment of byte packing order (endianness) is indispensable in cross- architecture binary translation if the source and target machines are of heterogeneous endianness, which may otherwise cause system failure. The issue is inconspicuous but may lead to significant performance bottleneck. This paper investigates the key aspects of endianness and finds several solutions to endian adjustment for cross-architecture binary translation. In particular, it considers the two principal methods of this field - byte swapping and address swizzling, and gives a comparison of them in our DBT (Dynamic Binary Translator) - CrossBit.
Local wavelet transform: a cost-efficient custom processor for space image compression
NASA Astrophysics Data System (ADS)
Masschelein, Bart; Bormans, Jan G.; Lafruit, Gauthier
2002-11-01
Thanks to its intrinsic scalability features, the wavelet transform has become increasingly popular as decorrelator in image compression applications. Throuhgput, memory requirements and complexity are important parameters when developing hardware image compression modules. An implementation of the classical, global wavelet transform requires large memory sizes and implies a large latency between the availability of the input image and the production of minimal data entities for entropy coding. Image tiling methods, as proposed by JPEG2000, reduce the memory sizes and the latency, but inevitably introduce image artefacts. The Local Wavelet Transform (LWT), presented in this paper, is a low-complexity wavelet transform architecture using a block-based processing that results in the same transformed images as those obtained by the global wavelet transform. The architecture minimizes the processing latency with a limited amount of memory. Moreover, as the LWT is an instruction-based custom processor, it can be programmed for specific tasks, such as push-broom processing of infinite-length satelite images. The features of the LWT makes it appropriate for use in space image compression, where high throughput, low memory sizes, low complexity, low power and push-broom processing are important requirements.
Gilgamesh: A Multithreaded Processor-In-Memory Architecture for Petaflops Computing
NASA Technical Reports Server (NTRS)
Sterling, T. L.; Zima, H. P.
2002-01-01
Processor-in-Memory (PIM) architectures avoid the von Neumann bottleneck in conventional machines by integrating high-density DRAM and CMOS logic on the same chip. Parallel systems based on this new technology are expected to provide higher scalability, adaptability, robustness, fault tolerance and lower power consumption than current MPPs or commodity clusters. In this paper we describe the design of Gilgamesh, a PIM-based massively parallel architecture, and elements of its execution model. Gilgamesh extends existing PIM capabilities by incorporating advanced mechanisms for virtualizing tasks and data and providing adaptive resource management for load balancing and latency tolerance. The Gilgamesh execution model is based on macroservers, a middleware layer which supports object-based runtime management of data and threads allowing explicit and dynamic control of locality and load balancing. The paper concludes with a discussion of related research activities and an outlook to future work.
NASA Astrophysics Data System (ADS)
Breskovic, Damir; Sikirica, Mladen; Begusic, Dinko
2018-05-01
This paper gives an overview and background of optical access network deployment in Croatia. Optical access network development in Croatia has been put into a global as well as in the European Union context. All the challenges and the driving factors for optical access networks deployment are considered. Optical access network architectures that have been deployed by most of the investors in Croatian telecommunication market are presented, as well as the architectures that are in early phase of deployment. Finally, an overview on current status of mobile networks of the fifth generation and Internet of Things is given.
The Architecture, Dynamics, and Development of Mental Processing: Greek, Chinese, or Universal?
ERIC Educational Resources Information Center
Demetriou, A.; Kui, Z.X.; Spanoudis, G.; Christou, C.; Kyriakides, L.; Platsidou, M.
2005-01-01
This study compared Greeks with Chinese, from 8 to 14 years of age, on measures of processing efficiency, working memory, and reasoning. All processes were addressed through three domains of relations: verbal/propositional, quantitative, and visuo/spatial. Structural equations modelling and rating scale analysis showed that the architecture and…
Midcentury Modern High Schools: Rebooting the Architecture
ERIC Educational Resources Information Center
Havens, Kevin
2010-01-01
A high school is more than a building; it's a repository of memories for many community members. High schools built at the turn of the century are not only cultural and civic landmarks, they are also often architectural treasures. When these facilities become outdated, a renovation that preserves the building's aesthetics and character is usually…
Architectural Techniques For Managing Non-volatile Caches
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mittal, Sparsh
As chip power dissipation becomes a critical challenge in scaling processor performance, computer architects are forced to fundamentally rethink the design of modern processors and hence, the chip-design industry is now at a major inflection point in its hardware roadmap. The high leakage power and low density of SRAM poses serious obstacles in its use for designing large on-chip caches and for this reason, researchers are exploring non-volatile memory (NVM) devices, such as spin torque transfer RAM, phase change RAM and resistive RAM. However, since NVMs are not strictly superior to SRAM, effective architectural techniques are required for making themmore » a universal memory solution. This book discusses techniques for designing processor caches using NVM devices. It presents algorithms and architectures for improving their energy efficiency, performance and lifetime. It also provides both qualitative and quantitative evaluation to help the reader gain insights and motivate them to explore further. This book will be highly useful for beginners as well as veterans in computer architecture, chip designers, product managers and technical marketing professionals.« less
MYSEA: The Monterey Security Architecture
2009-01-01
Security and Protection, Organization and Design General Terms: Design; Security Keywords: access controls, authentication, information flow controls...Applicable environments include: mil- itary coalitions, agencies and organizations responding to security emergencies, and mandated sharing in business ...network architecture affords users the abil- ity to securely access information across networks at dif- ferent classifications using standardized
Do Performance-Based Codes Support Universal Design in Architecture?
Grangaard, Sidse; Frandsen, Anne Kathrine
2016-01-01
The research project 'An analysis of the accessibility requirements' studies how Danish architectural firms experience the accessibility requirements of the Danish Building Regulations and it examines their opinions on how future regulative models can support innovative and inclusive design - Universal Design (UD). The empirical material consists of input from six workshops to which all 700 Danish Architectural firms were invited, as well as eight group interviews. The analysis shows that the current prescriptive requirements are criticized for being too homogenous and possibilities for differentiation and zoning are required. Therefore, a majority of professionals are interested in a performance-based model because they think that such a model will support 'accessibility zoning', achieving flexibility because of different levels of accessibility in a building due to its performance. The common understanding of accessibility and UD is directly related to buildings like hospitals and care centers. When the objective is both innovative and inclusive architecture, the request of a performance-based model should be followed up by a knowledge enhancement effort in the building sector. Bloom's taxonomy of educational objectives is suggested as a tool for such a boost. The research project has been financed by the Danish Transport and Construction Agency.
Modeling aspects of human memory for scientific study.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Caudell, Thomas P.; Watson, Patrick; McDaniel, Mark A.
Working with leading experts in the field of cognitive neuroscience and computational intelligence, SNL has developed a computational architecture that represents neurocognitive mechanisms associated with how humans remember experiences in their past. The architecture represents how knowledge is organized and updated through information from individual experiences (episodes) via the cortical-hippocampal declarative memory system. We compared the simulated behavioral characteristics with those of humans measured under well established experimental standards, controlling for unmodeled aspects of human processing, such as perception. We used this knowledge to create robust simulations of & human memory behaviors that should help move the scientific community closermore » to understanding how humans remember information. These behaviors were experimentally validated against actual human subjects, which was published. An important outcome of the validation process will be the joining of specific experimental testing procedures from the field of neuroscience with computational representations from the field of cognitive modeling and simulation.« less
Some pitfalls in measuring memory in animals.
Thorpe, Christina M; Jacova, Claudia; Wilkie, Donald M
2004-11-01
Because the presence or absence of memories in the brain cannot be directly observed, scientists must rely on indirect measures and use inferential reasoning to make statements about the status of memories. In humans, memories are often accessed through spoken or written language. In animals, memory is accessed through overt behaviours such as running down an arm in a maze, pressing a lever, or visiting a food cache site. Because memory is measured by these indirect methods, errors in the veracity of statements about memory can occur. In this brief paper, we identify three areas that may serve as pitfalls in reasoning about memory in animals: (1) the presence of 'silent associations', (2) intrusions of species-typical behaviours on memory tasks, and (3) improper mapping between human and animals memory tasks. There are undoubtedly other areas in which scientists should act cautiously when reasoning about the status of memory.
NASA Technical Reports Server (NTRS)
Ivancic, William D.
2003-01-01
Traditional NASA missions, both near Earth and deep space, have been stovepipe in nature and point-to-point in architecture. Recently, NASA and others have conceptualized missions that required space-based networking. The notion of networks in space is a drastic shift in thinking and requires entirely new architectures, radio systems (antennas, modems, and media access), and possibly even new protocols. A full system engineering approach for some key mission architectures will occur that considers issues such as the science being performed, stationkeeping, antenna size, contact time, data rates, radio-link power requirements, media access techniques, and appropriate networking and transport protocols. This report highlights preliminary architecture concepts and key technologies that will be investigated.
Sawmill: A Logging File System for a High-Performance RAID Disk Array
1995-01-01
from limiting disk performance, new controller architectures connect the disks directly to the network so that data movement bypasses the file server...These developments raise two questions for file systems: how to get the best performance from a RAID, and how to use such a controller architecture ...the RAID-II storage system; this architecture provides a fast data path that moves data rapidly among the disks, high-speed controller memory, and the
A Decision Model for Selection of Microcomputers and Operating Systems.
1984-06-01
is resilting in application software (for microccmputers) being developed almost exclu- sively tor the IBM PC and compatiole systems. NAVDAC ielt that...location can be indepen- dently accessed. RAN memory is also often called read/ write memory, hecause new information can be written into and read from...when power is lost; this is also read/ write memory. Bubble memory, however, has significantly slower access times than RAM or RON and also is not preva
Optimizing Irregular Applications for Energy and Performance on the Tilera Many-core Architecture
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chavarría-Miranda, Daniel; Panyala, Ajay R.; Halappanavar, Mahantesh
Optimizing applications simultaneously for energy and performance is a complex problem. High performance, parallel, irregular applications are notoriously hard to optimize due to their data-dependent memory accesses, lack of structured locality and complex data structures and code patterns. Irregular kernels are growing in importance in applications such as machine learning, graph analytics and combinatorial scientific computing. Performance- and energy-efficient implementation of these kernels on modern, energy efficient, multicore and many-core platforms is therefore an important and challenging problem. We present results from optimizing two irregular applications { the Louvain method for community detection (Grappolo), and high-performance conjugate gradient (HPCCG) {more » on the Tilera many-core system. We have significantly extended MIT's OpenTuner auto-tuning framework to conduct a detailed study of platform-independent and platform-specific optimizations to improve performance as well as reduce total energy consumption. We explore the optimization design space along three dimensions: memory layout schemes, compiler-based code transformations, and optimization of parallel loop schedules. Using auto-tuning, we demonstrate whole node energy savings of up to 41% relative to a baseline instantiation, and up to 31% relative to manually optimized variants.« less
GPU implementation of the simplex identification via split augmented Lagrangian
NASA Astrophysics Data System (ADS)
Sevilla, Jorge; Nascimento, José M. P.
2015-10-01
Hyperspectral imaging can be used for object detection and for discriminating between different objects based on their spectral characteristics. One of the main problems of hyperspectral data analysis is the presence of mixed pixels, due to the low spatial resolution of such images. This means that several spectrally pure signatures (endmembers) are combined into the same mixed pixel. Linear spectral unmixing follows an unsupervised approach which aims at inferring pure spectral signatures and their material fractions at each pixel of the scene. The huge data volumes acquired by such sensors put stringent requirements on processing and unmixing methods. This paper proposes an efficient implementation of a unsupervised linear unmixing method on GPUs using CUDA. The method finds the smallest simplex by solving a sequence of nonsmooth convex subproblems using variable splitting to obtain a constraint formulation, and then applying an augmented Lagrangian technique. The parallel implementation of SISAL presented in this work exploits the GPU architecture at low level, using shared memory and coalesced accesses to memory. The results herein presented indicate that the GPU implementation can significantly accelerate the method's execution over big datasets while maintaining the methods accuracy.
Performance of the Magnetospheric Multiscale central instrument data handling
NASA Astrophysics Data System (ADS)
Klar, Robert A.; Miller, Scott A.; Brysch, Michael L.; Bertrand, Allison R.
In order to study the fundamental physical processes of magnetic reconnection, particle acceleration and turbulence, the Magnetospheric Multiscale (MMS) mission employs a constellation of four identically configured observatories, each with a suite of complementary science instruments. Southwest Research Institute® (SwRI® ) developed the Central Instrument Data Processor (CIDP) to handle the large data volume associated with these instruments. The CIDP is an integrated access point between the instruments and the spacecraft. It provides synchronization pulses, relays telecommands, and gathers instrument housekeeping telemetry. It collects science data from the instruments and stores it to a mass memory for later playback to a ground station. This paper retrospectively examines the data handling performance realized by the CIDP implementation. It elaborates on some of the constraints on the hardware and software designs and the resulting effects on performance. For the hardware, it discusses the limitations of the front-end electronics input/output (I/O) architecture and associated mass memory buffering. For the software, it discusses the limitations of the Consultative Committee for Space Data Systems (CCSDS) File Delivery Protocol (CFDP) implementation and the data structure choices for file management. It also describes design changes that improve data handling performance in newer designs.
Mniszewski, S M; Cawkwell, M J; Wall, M E; Mohd-Yusof, J; Bock, N; Germann, T C; Niklasson, A M N
2015-10-13
We present an algorithm for the calculation of the density matrix that for insulators scales linearly with system size and parallelizes efficiently on multicore, shared memory platforms with small and controllable numerical errors. The algorithm is based on an implementation of the second-order spectral projection (SP2) algorithm [ Niklasson, A. M. N. Phys. Rev. B 2002 , 66 , 155115 ] in sparse matrix algebra with the ELLPACK-R data format. We illustrate the performance of the algorithm within self-consistent tight binding theory by total energy calculations of gas phase poly(ethylene) molecules and periodic liquid water systems containing up to 15,000 atoms on up to 16 CPU cores. We consider algorithm-specific performance aspects, such as local vs nonlocal memory access and the degree of matrix sparsity. Comparisons to sparse matrix algebra implementations using off-the-shelf libraries on multicore CPUs, graphics processing units (GPUs), and the Intel many integrated core (MIC) architecture are also presented. The accuracy and stability of the algorithm are illustrated with long duration Born-Oppenheimer molecular dynamics simulations of 1000 water molecules and a 303 atom Trp cage protein solvated by 2682 water molecules.
Parallel design of JPEG-LS encoder on graphics processing units
NASA Astrophysics Data System (ADS)
Duan, Hao; Fang, Yong; Huang, Bormin
2012-01-01
With recent technical advances in graphic processing units (GPUs), GPUs have outperformed CPUs in terms of compute capability and memory bandwidth. Many successful GPU applications to high performance computing have been reported. JPEG-LS is an ISO/IEC standard for lossless image compression which utilizes adaptive context modeling and run-length coding to improve compression ratio. However, adaptive context modeling causes data dependency among adjacent pixels and the run-length coding has to be performed in a sequential way. Hence, using JPEG-LS to compress large-volume hyperspectral image data is quite time-consuming. We implement an efficient parallel JPEG-LS encoder for lossless hyperspectral compression on a NVIDIA GPU using the computer unified device architecture (CUDA) programming technology. We use the block parallel strategy, as well as such CUDA techniques as coalesced global memory access, parallel prefix sum, and asynchronous data transfer. We also show the relation between GPU speedup and AVIRIS block size, as well as the relation between compression ratio and AVIRIS block size. When AVIRIS images are divided into blocks, each with 64×64 pixels, we gain the best GPU performance with 26.3x speedup over its original CPU code.
MemAxes: Visualization and Analytics for Characterizing Complex Memory Performance Behaviors.
Gimenez, Alfredo; Gamblin, Todd; Jusufi, Ilir; Bhatele, Abhinav; Schulz, Martin; Bremer, Peer-Timo; Hamann, Bernd
2018-07-01
Memory performance is often a major bottleneck for high-performance computing (HPC) applications. Deepening memory hierarchies, complex memory management, and non-uniform access times have made memory performance behavior difficult to characterize, and users require novel, sophisticated tools to analyze and optimize this aspect of their codes. Existing tools target only specific factors of memory performance, such as hardware layout, allocations, or access instructions. However, today's tools do not suffice to characterize the complex relationships between these factors. Further, they require advanced expertise to be used effectively. We present MemAxes, a tool based on a novel approach for analytic-driven visualization of memory performance data. MemAxes uniquely allows users to analyze the different aspects related to memory performance by providing multiple visual contexts for a centralized dataset. We define mappings of sampled memory access data to new and existing visual metaphors, each of which enabling a user to perform different analysis tasks. We present methods to guide user interaction by scoring subsets of the data based on known performance problems. This scoring is used to provide visual cues and automatically extract clusters of interest. We designed MemAxes in collaboration with experts in HPC and demonstrate its effectiveness in case studies.
Integrated semiconductor-magnetic random access memory system
NASA Technical Reports Server (NTRS)
Katti, Romney R. (Inventor); Blaes, Brent R. (Inventor)
2001-01-01
The present disclosure describes a non-volatile magnetic random access memory (RAM) system having a semiconductor control circuit and a magnetic array element. The integrated magnetic RAM system uses CMOS control circuit to read and write data magnetoresistively. The system provides a fast access, non-volatile, radiation hard, high density RAM for high speed computing.
NASA Astrophysics Data System (ADS)
Hashimoto, Ryoji; Matsumura, Tomoya; Nozato, Yoshihiro; Watanabe, Kenji; Onoye, Takao
A multi-agent object attention system is proposed, which is based on biologically inspired attractor selection model. Object attention is facilitated by using a video sequence and a depth map obtained through a compound-eye image sensor TOMBO. Robustness of the multi-agent system over environmental changes is enhanced by utilizing the biological model of adaptive response by attractor selection. To implement the proposed system, an efficient VLSI architecture is employed with reducing enormous computational costs and memory accesses required for depth map processing and multi-agent attractor selection process. According to the FPGA implementation result of the proposed object attention system, which is accomplished by using 7,063 slices, 640×512 pixel input images can be processed in real-time with three agents at a rate of 9fps in 48MHz operation.
Unobtrusive Software and System Health Management with R2U2 on a Parallel MIMD Coprocessor
NASA Technical Reports Server (NTRS)
Schumann, Johann; Moosbrugger, Patrick
2017-01-01
Dynamic monitoring of software and system health of a complex cyber-physical system requires observers that continuously monitor variables of the embedded software in order to detect anomalies and reason about root causes. There exists a variety of techniques for code instrumentation, but instrumentation might change runtime behavior and could require costly software re-certification. In this paper, we present R2U2E, a novel realization of our real-time, Realizable, Responsive, and Unobtrusive Unit (R2U2). The R2U2E observers are executed in parallel on a dedicated 16-core EPIPHANY co-processor, thereby avoiding additional computational overhead to the system under observation. A DMA-based shared memory access architecture allows R2U2E to operate without any code instrumentation or program interference.