Efficient Hardware Implementation of the Lightweight Block Encryption Algorithm LEA
Lee, Donggeon; Kim, Dong-Chan; Kwon, Daesung; Kim, Howon
2014-01-01
Recently, due to the advent of resource-constrained trends, such as smartphones and smart devices, the computing environment is changing. Because our daily life is deeply intertwined with ubiquitous networks, the importance of security is growing. A lightweight encryption algorithm is essential for secure communication between these kinds of resource-constrained devices, and many researchers have been investigating this field. Recently, a lightweight block cipher called LEA was proposed. LEA was originally targeted for efficient implementation on microprocessors, as it is fast when implemented in software and furthermore, it has a small memory footprint. To reflect on recent technology, all required calculations utilize 32-bit wide operations. In addition, the algorithm is comprised of not complex S-Box-like structures but simple Addition, Rotation, and XOR operations. To the best of our knowledge, this paper is the first report on a comprehensive hardware implementation of LEA. We present various hardware structures and their implementation results according to key sizes. Even though LEA was originally targeted at software efficiency, it also shows high efficiency when implemented as hardware. PMID:24406859
Compiler-assisted multiple instruction rollback recovery using a read buffer
NASA Technical Reports Server (NTRS)
Alewine, N. J.; Chen, S.-K.; Fuchs, W. K.; Hwu, W.-M.
1993-01-01
Multiple instruction rollback (MIR) is a technique that has been implemented in mainframe computers to provide rapid recovery from transient processor failures. Hardware-based MIR designs eliminate rollback data hazards by providing data redundancy implemented in hardware. Compiler-based MIR designs have also been developed which remove rollback data hazards directly with data-flow transformations. This paper focuses on compiler-assisted techniques to achieve multiple instruction rollback recovery. We observe that some data hazards resulting from instruction rollback can be resolved efficiently by providing an operand read buffer while others are resolved more efficiently with compiler transformations. A compiler-assisted multiple instruction rollback scheme is developed which combines hardware-implemented data redundancy with compiler-driven hazard removal transformations. Experimental performance evaluations indicate improved efficiency over previous hardware-based and compiler-based schemes.
Compiler-Assisted Multiple Instruction Rollback Recovery Using a Read Buffer. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Alewine, Neal Jon
1993-01-01
Multiple instruction rollback (MIR) is a technique to provide rapid recovery from transient processor failures and was implemented in hardware by researchers and slow in mainframe computers. Hardware-based MIR designs eliminate rollback data hazards by providing data redundancy implemented in hardware. Compiler-based MIR designs were also developed which remove rollback data hazards directly with data flow manipulations, thus eliminating the need for most data redundancy hardware. Compiler-assisted techniques to achieve multiple instruction rollback recovery are addressed. It is observed that data some hazards resulting from instruction rollback can be resolved more efficiently by providing hardware redundancy while others are resolved more efficiently with compiler transformations. A compiler-assisted multiple instruction rollback scheme is developed which combines hardware-implemented data redundancy with compiler-driven hazard removal transformations. Experimental performance evaluations were conducted which indicate improved efficiency over previous hardware-based and compiler-based schemes. Various enhancements to the compiler transformations and to the data redundancy hardware developed for the compiler-assisted MIR scheme are described and evaluated. The final topic deals with the application of compiler-assisted MIR techniques to aid in exception repair and branch repair in a speculative execution architecture.
Minho Won; Albalawi, Hassan; Xin Li; Thomas, Donald E
2014-01-01
This paper describes a low-power hardware implementation for movement decoding of brain computer interface. Our proposed hardware design is facilitated by two novel ideas: (i) an efficient feature extraction method based on reduced-resolution discrete cosine transform (DCT), and (ii) a new hardware architecture of dual look-up table to perform discrete cosine transform without explicit multiplication. The proposed hardware implementation has been validated for movement decoding of electrocorticography (ECoG) signal by using a Xilinx FPGA Zynq-7000 board. It achieves more than 56× energy reduction over a reference design using band-pass filters for feature extraction.
NASA Astrophysics Data System (ADS)
Kelly, Jamie S.; Bowman, Hiroshi C.; Rao, Vittal S.; Pottinger, Hardy J.
1997-06-01
Implementation issues represent an unfamiliar challenge to most control engineers, and many techniques for controller design ignore these issues outright. Consequently, the design of controllers for smart structural systems usually proceeds without regard for their eventual implementation, thus resulting either in serious performance degradation or in hardware requirements that squander power, complicate integration, and drive up cost. The level of integration assumed by the Smart Patch further exacerbates these difficulties, and any design inefficiency may render the realization of a single-package sensor-controller-actuator system infeasible. The goal of this research is to automate the controller implementation process and to relieve the design engineer of implementation concerns like quantization, computational efficiency, and device selection. We specifically target Field Programmable Gate Arrays (FPGAs) as our hardware platform because these devices are highly flexible, power efficient, and reprogrammable. The current study develops an automated implementation sequence that minimizes hardware requirements while maintaining controller performance. Beginning with a state space representation of the controller, the sequence automatically generates a configuration bitstream for a suitable FPGA implementation. MATLAB functions optimize and simulate the control algorithm before translating it into the VHSIC hardware description language. These functions improve power efficiency and simplify integration in the final implementation by performing a linear transformation that renders the controller computationally friendly. The transformation favors sparse matrices in order to reduce multiply operations and the hardware necessary to support them; simultaneously, the remaining matrix elements take on values that minimize limit cycles and parameter sensitivity. The proposed controller design methodology is implemented on a simple cantilever beam test structure using FPGA hardware. The experimental closed loop response is compared with that of an automated FPGA controller implementation. Finally, we explore the integration of FPGA based controllers into a multi-chip module, which we believe represents the next step towards the realization of the Smart Patch.
Efficient architecture for spike sorting in reconfigurable hardware.
Hwang, Wen-Jyi; Lee, Wei-Hao; Lin, Shiow-Jyu; Lai, Sheng-Ying
2013-11-01
This paper presents a novel hardware architecture for fast spike sorting. The architecture is able to perform both the feature extraction and clustering in hardware. The generalized Hebbian algorithm (GHA) and fuzzy C-means (FCM) algorithm are used for feature extraction and clustering, respectively. The employment of GHA allows efficient computation of principal components for subsequent clustering operations. The FCM is able to achieve near optimal clustering for spike sorting. Its performance is insensitive to the selection of initial cluster centers. The hardware implementations of GHA and FCM feature low area costs and high throughput. In the GHA architecture, the computation of different weight vectors share the same circuit for lowering the area costs. Moreover, in the FCM hardware implementation, the usual iterative operations for updating the membership matrix and cluster centroid are merged into one single updating process to evade the large storage requirement. To show the effectiveness of the circuit, the proposed architecture is physically implemented by field programmable gate array (FPGA). It is embedded in a System-on-Chip (SOC) platform for performance measurement. Experimental results show that the proposed architecture is an efficient spike sorting design for attaining high classification correct rate and high speed computation.
Hardware accelerator of convolution with exponential function for image processing applications
NASA Astrophysics Data System (ADS)
Panchenko, Ivan; Bucha, Victor
2015-12-01
In this paper we describe a Hardware Accelerator (HWA) for fast recursive approximation of separable convolution with exponential function. This filter can be used in many Image Processing (IP) applications, e.g. depth-dependent image blur, image enhancement and disparity estimation. We have adopted this filter RTL implementation to provide maximum throughput in constrains of required memory bandwidth and hardware resources to provide a power-efficient VLSI implementation.
ELIPS: Toward a Sensor Fusion Processor on a Chip
NASA Technical Reports Server (NTRS)
Daud, Taher; Stoica, Adrian; Tyson, Thomas; Li, Wei-te; Fabunmi, James
1998-01-01
The paper presents the concept and initial tests from the hardware implementation of a low-power, high-speed reconfigurable sensor fusion processor. The Extended Logic Intelligent Processing System (ELIPS) processor is developed to seamlessly combine rule-based systems, fuzzy logic, and neural networks to achieve parallel fusion of sensor in compact low power VLSI. The first demonstration of the ELIPS concept targets interceptor functionality; other applications, mainly in robotics and autonomous systems are considered for the future. The main assumption behind ELIPS is that fuzzy, rule-based and neural forms of computation can serve as the main primitives of an "intelligent" processor. Thus, in the same way classic processors are designed to optimize the hardware implementation of a set of fundamental operations, ELIPS is developed as an efficient implementation of computational intelligence primitives, and relies on a set of fuzzy set, fuzzy inference and neural modules, built in programmable analog hardware. The hardware programmability allows the processor to reconfigure into different machines, taking the most efficient hardware implementation during each phase of information processing. Following software demonstrations on several interceptor data, three important ELIPS building blocks (a fuzzy set preprocessor, a rule-based fuzzy system and a neural network) have been fabricated in analog VLSI hardware and demonstrated microsecond-processing times.
A hardware implementation of the discrete Pascal transform for image processing
NASA Astrophysics Data System (ADS)
Goodman, Thomas J.; Aburdene, Maurice F.
2006-02-01
The discrete Pascal transform is a polynomial transform with applications in pattern recognition, digital filtering, and digital image processing. It already has been shown that the Pascal transform matrix can be decomposed into a product of binary matrices. Such a factorization leads to a fast and efficient hardware implementation without the use of multipliers, which consume large amounts of hardware. We recently developed a field-programmable gate array (FPGA) implementation to compute the Pascal transform. Our goal was to demonstrate the computational efficiency of the transform while keeping hardware requirements at a minimum. Images are uploaded into memory from a remote computer prior to processing, and the transform coefficients can be offloaded from the FPGA board for analysis. Design techniques like as-soon-as-possible scheduling and adder sharing allowed us to develop a fast and efficient system. An eight-point, one-dimensional transform completes in 13 clock cycles and requires only four adders. An 8x8 two-dimensional transform completes in 240 cycles and requires only a top-level controller in addition to the one-dimensional transform hardware. Finally, through minor modifications to the controller, the transform operations can be pipelined to achieve 100% utilization of the four adders, allowing one eight-point transform to complete every seven clock cycles.
Extended Logic Intelligent Processing System for a Sensor Fusion Processor Hardware
NASA Technical Reports Server (NTRS)
Stoica, Adrian; Thomas, Tyson; Li, Wei-Te; Daud, Taher; Fabunmi, James
2000-01-01
The paper presents the hardware implementation and initial tests from a low-power, highspeed reconfigurable sensor fusion processor. The Extended Logic Intelligent Processing System (ELIPS) is described, which combines rule-based systems, fuzzy logic, and neural networks to achieve parallel fusion of sensor signals in compact low power VLSI. The development of the ELIPS concept is being done to demonstrate the interceptor functionality which particularly underlines the high speed and low power requirements. The hardware programmability allows the processor to reconfigure into different machines, taking the most efficient hardware implementation during each phase of information processing. Processing speeds of microseconds have been demonstrated using our test hardware.
Study of efficient video compression algorithms for space shuttle applications
NASA Technical Reports Server (NTRS)
Poo, Z.
1975-01-01
Results are presented of a study on video data compression techniques applicable to space flight communication. This study is directed towards monochrome (black and white) picture communication with special emphasis on feasibility of hardware implementation. The primary factors for such a communication system in space flight application are: picture quality, system reliability, power comsumption, and hardware weight. In terms of hardware implementation, these are directly related to hardware complexity, effectiveness of the hardware algorithm, immunity of the source code to channel noise, and data transmission rate (or transmission bandwidth). A system is recommended, and its hardware requirement summarized. Simulations of the study were performed on the improved LIM video controller which is computer-controlled by the META-4 CPU.
NASA Astrophysics Data System (ADS)
Azimi, Ehsan; Behrad, Alireza; Ghaznavi-Ghoushchi, Mohammad Bagher; Shanbehzadeh, Jamshid
2016-11-01
The projective model is an important mapping function for the calculation of global transformation between two images. However, its hardware implementation is challenging because of a large number of coefficients with different required precisions for fixed point representation. A VLSI hardware architecture is proposed for the calculation of a global projective model between input and reference images and refining false matches using random sample consensus (RANSAC) algorithm. To make the hardware implementation feasible, it is proved that the calculation of the projective model can be divided into four submodels comprising two translations, an affine model and a simpler projective mapping. This approach makes the hardware implementation feasible and considerably reduces the required number of bits for fixed point representation of model coefficients and intermediate variables. The proposed hardware architecture for the calculation of a global projective model using the RANSAC algorithm was implemented using Verilog hardware description language and the functionality of the design was validated through several experiments. The proposed architecture was synthesized by using an application-specific integrated circuit digital design flow utilizing 180-nm CMOS technology as well as a Virtex-6 field programmable gate array. Experimental results confirm the efficiency of the proposed hardware architecture in comparison with software implementation.
Fast and Adaptive Lossless Onboard Hyperspectral Data Compression System
NASA Technical Reports Server (NTRS)
Aranki, Nazeeh I.; Keymeulen, Didier; Kimesh, Matthew A.
2012-01-01
Modern hyperspectral imaging systems are able to acquire far more data than can be downlinked from a spacecraft. Onboard data compression helps to alleviate this problem, but requires a system capable of power efficiency and high throughput. Software solutions have limited throughput performance and are power-hungry. Dedicated hardware solutions can provide both high throughput and power efficiency, while taking the load off of the main processor. Thus a hardware compression system was developed. The implementation uses a field-programmable gate array (FPGA). The implementation is based on the fast lossless (FL) compression algorithm reported in Fast Lossless Compression of Multispectral-Image Data (NPO-42517), NASA Tech Briefs, Vol. 30, No. 8 (August 2006), page 26, which achieves excellent compression performance and has low complexity. This algorithm performs predictive compression using an adaptive filtering method, and uses adaptive Golomb coding. The implementation also packetizes the coded data. The FL algorithm is well suited for implementation in hardware. In the FPGA implementation, one sample is compressed every clock cycle, which makes for a fast and practical realtime solution for space applications. Benefits of this implementation are: 1) The underlying algorithm achieves a combination of low complexity and compression effectiveness that exceeds that of techniques currently in use. 2) The algorithm requires no training data or other specific information about the nature of the spectral bands for a fixed instrument dynamic range. 3) Hardware acceleration provides a throughput improvement of 10 to 100 times vs. the software implementation. A prototype of the compressor is available in software, but it runs at a speed that does not meet spacecraft requirements. The hardware implementation targets the Xilinx Virtex IV FPGAs, and makes the use of this compressor practical for Earth satellites as well as beyond-Earth missions with hyperspectral instruments.
Event-driven processing for hardware-efficient neural spike sorting
NASA Astrophysics Data System (ADS)
Liu, Yan; Pereira, João L.; Constandinou, Timothy G.
2018-02-01
Objective. The prospect of real-time and on-node spike sorting provides a genuine opportunity to push the envelope of large-scale integrated neural recording systems. In such systems the hardware resources, power requirements and data bandwidth increase linearly with channel count. Event-based (or data-driven) processing can provide here a new efficient means for hardware implementation that is completely activity dependant. In this work, we investigate using continuous-time level-crossing sampling for efficient data representation and subsequent spike processing. Approach. (1) We first compare signals (synthetic neural datasets) encoded with this technique against conventional sampling. (2) We then show how such a representation can be directly exploited by extracting simple time domain features from the bitstream to perform neural spike sorting. (3) The proposed method is implemented in a low power FPGA platform to demonstrate its hardware viability. Main results. It is observed that considerably lower data rates are achievable when using 7 bits or less to represent the signals, whilst maintaining the signal fidelity. Results obtained using both MATLAB and reconfigurable logic hardware (FPGA) indicate that feature extraction and spike sorting accuracies can be achieved with comparable or better accuracy than reference methods whilst also requiring relatively low hardware resources. Significance. By effectively exploiting continuous-time data representation, neural signal processing can be achieved in a completely event-driven manner, reducing both the required resources (memory, complexity) and computations (operations). This will see future large-scale neural systems integrating on-node processing in real-time hardware.
Compiler-assisted multiple instruction rollback recovery using a read buffer
NASA Technical Reports Server (NTRS)
Alewine, Neal J.; Chen, Shyh-Kwei; Fuchs, W. Kent; Hwu, Wen-Mei W.
1995-01-01
Multiple instruction rollback (MIR) is a technique that has been implemented in mainframe computers to provide rapid recovery from transient processor failures. Hardware-based MIR designs eliminate rollback data hazards by providing data redundancy implemented in hardware. Compiler-based MIR designs have also been developed which remove rollback data hazards directly with data-flow transformations. This paper describes compiler-assisted techniques to achieve multiple instruction rollback recovery. We observe that some data hazards resulting from instruction rollback can be resolved efficiently by providing an operand read buffer while others are resolved more efficiently with compiler transformations. The compiler-assisted scheme presented consists of hardware that is less complex than shadow files, history files, history buffers, or delayed write buffers, while experimental evaluation indicates performance improvement over compiler-based schemes.
Rodríguez, Manuel; Magdaleno, Eduardo; Pérez, Fernando; García, Cristhian
2017-03-28
Non-equispaced Fast Fourier transform (NFFT) is a very important algorithm in several technological and scientific areas such as synthetic aperture radar, computational photography, medical imaging, telecommunications, seismic analysis and so on. However, its computation complexity is high. In this paper, we describe an efficient NFFT implementation with a hardware coprocessor using an All-Programmable System-on-Chip (APSoC). This is a hybrid device that employs an Advanced RISC Machine (ARM) as Processing System with Programmable Logic for high-performance digital signal processing through parallelism and pipeline techniques. The algorithm has been coded in C language with pragma directives to optimize the architecture of the system. We have used the very novel Software Develop System-on-Chip (SDSoC) evelopment tool that simplifies the interface and partitioning between hardware and software. This provides shorter development cycles and iterative improvements by exploring several architectures of the global system. The computational results shows that hardware acceleration significantly outperformed the software based implementation.
Rodríguez, Manuel; Magdaleno, Eduardo; Pérez, Fernando; García, Cristhian
2017-01-01
Non-equispaced Fast Fourier transform (NFFT) is a very important algorithm in several technological and scientific areas such as synthetic aperture radar, computational photography, medical imaging, telecommunications, seismic analysis and so on. However, its computation complexity is high. In this paper, we describe an efficient NFFT implementation with a hardware coprocessor using an All-Programmable System-on-Chip (APSoC). This is a hybrid device that employs an Advanced RISC Machine (ARM) as Processing System with Programmable Logic for high-performance digital signal processing through parallelism and pipeline techniques. The algorithm has been coded in C language with pragma directives to optimize the architecture of the system. We have used the very novel Software Develop System-on-Chip (SDSoC) evelopment tool that simplifies the interface and partitioning between hardware and software. This provides shorter development cycles and iterative improvements by exploring several architectures of the global system. The computational results shows that hardware acceleration significantly outperformed the software based implementation. PMID:28350358
FPGA implementation of sparse matrix algorithm for information retrieval
NASA Astrophysics Data System (ADS)
Bojanic, Slobodan; Jevtic, Ruzica; Nieto-Taladriz, Octavio
2005-06-01
Information text data retrieval requires a tremendous amount of processing time because of the size of the data and the complexity of information retrieval algorithms. In this paper the solution to this problem is proposed via hardware supported information retrieval algorithms. Reconfigurable computing may adopt frequent hardware modifications through its tailorable hardware and exploits parallelism for a given application through reconfigurable and flexible hardware units. The degree of the parallelism can be tuned for data. In this work we implemented standard BLAS (basic linear algebra subprogram) sparse matrix algorithm named Compressed Sparse Row (CSR) that is showed to be more efficient in terms of storage space requirement and query-processing timing over the other sparse matrix algorithms for information retrieval application. Although inverted index algorithm is treated as the de facto standard for information retrieval for years, an alternative approach to store the index of text collection in a sparse matrix structure gains more attention. This approach performs query processing using sparse matrix-vector multiplication and due to parallelization achieves a substantial efficiency over the sequential inverted index. The parallel implementations of information retrieval kernel are presented in this work targeting the Virtex II Field Programmable Gate Arrays (FPGAs) board from Xilinx. A recent development in scientific applications is the use of FPGA to achieve high performance results. Computational results are compared to implementations on other platforms. The design achieves a high level of parallelism for the overall function while retaining highly optimised hardware within processing unit.
NASA Astrophysics Data System (ADS)
Sun, Wenhao; Cai, Xudong; Meng, Qiao
2016-04-01
Complex automatic protection functions are being added to the onboard software of the Alpha Magnetic Spectrometer. A hardware-in-the-loop simulation method has been introduced to overcome the difficulties of ground testing that are brought by hardware and environmental limitations. We invented a time-saving approach by reusing the flight data as the data source of the simulation system instead of mathematical models. This is easy to implement and it works efficiently. This paper presents the system framework, implementation details and some application examples.
A cellular automata based FPGA realization of a new metaheuristic bat-inspired algorithm
NASA Astrophysics Data System (ADS)
Progias, Pavlos; Amanatiadis, Angelos A.; Spataro, William; Trunfio, Giuseppe A.; Sirakoulis, Georgios Ch.
2016-10-01
Optimization algorithms are often inspired by processes occuring in nature, such as animal behavioral patterns. The main concern with implementing such algorithms in software is the large amounts of processing power they require. In contrast to software code, that can only perform calculations in a serial manner, an implementation in hardware, exploiting the inherent parallelism of single-purpose processors, can prove to be much more efficient both in speed and energy consumption. Furthermore, the use of Cellular Automata (CA) in such an implementation would be efficient both as a model for natural processes, as well as a computational paradigm implemented well on hardware. In this paper, we propose a VHDL implementation of a metaheuristic algorithm inspired by the echolocation behavior of bats. More specifically, the CA model is inspired by the metaheuristic algorithm proposed earlier in the literature, which could be considered at least as efficient than other existing optimization algorithms. The function of the FPGA implementation of our algorithm is explained in full detail and results of our simulations are also demonstrated.
Narasimhan, Seetharam; Chiel, Hillel J; Bhunia, Swarup
2009-01-01
For implantable neural interface applications, it is important to compress data and analyze spike patterns across multiple channels in real time. Such a computational task for online neural data processing requires an innovative circuit-architecture level design approach for low-power, robust and area-efficient hardware implementation. Conventional microprocessor or Digital Signal Processing (DSP) chips would dissipate too much power and are too large in size for an implantable system. In this paper, we propose a novel hardware design approach, referred to as "Preferential Design" that exploits the nature of the neural signal processing algorithm to achieve a low-voltage, robust and area-efficient implementation using nanoscale process technology. The basic idea is to isolate the critical components with respect to system performance and design them more conservatively compared to the noncritical ones. This allows aggressive voltage scaling for low power operation while ensuring robustness and area efficiency. We have applied the proposed approach to a neural signal processing algorithm using the Discrete Wavelet Transform (DWT) and observed significant improvement in power and robustness over conventional design.
Reconfigurable Hardware for Compressing Hyperspectral Image Data
NASA Technical Reports Server (NTRS)
Aranki, Nazeeh; Namkung, Jeffrey; Villapando, Carlos; Kiely, Aaron; Klimesh, Matthew; Xie, Hua
2010-01-01
High-speed, low-power, reconfigurable electronic hardware has been developed to implement ICER-3D, an algorithm for compressing hyperspectral-image data. The algorithm and parts thereof have been the topics of several NASA Tech Briefs articles, including Context Modeler for Wavelet Compression of Hyperspectral Images (NPO-43239) and ICER-3D Hyperspectral Image Compression Software (NPO-43238), which appear elsewhere in this issue of NASA Tech Briefs. As described in more detail in those articles, the algorithm includes three main subalgorithms: one for computing wavelet transforms, one for context modeling, and one for entropy encoding. For the purpose of designing the hardware, these subalgorithms are treated as modules to be implemented efficiently in field-programmable gate arrays (FPGAs). The design takes advantage of industry- standard, commercially available FPGAs. The implementation targets the Xilinx Virtex II pro architecture, which has embedded PowerPC processor cores with flexible on-chip bus architecture. It incorporates an efficient parallel and pipelined architecture to compress the three-dimensional image data. The design provides for internal buffering to minimize intensive input/output operations while making efficient use of offchip memory. The design is scalable in that the subalgorithms are implemented as independent hardware modules that can be combined in parallel to increase throughput. The on-chip processor manages the overall operation of the compression system, including execution of the top-level control functions as well as scheduling, initiating, and monitoring processes. The design prototype has been demonstrated to be capable of compressing hyperspectral data at a rate of 4.5 megasamples per second at a conservative clock frequency of 50 MHz, with a potential for substantially greater throughput at a higher clock frequency. The power consumption of the prototype is less than 6.5 W. The reconfigurability (by means of reprogramming) of the FPGAs makes it possible to effectively alter the design to some extent to satisfy different requirements without adding hardware. The implementation could be easily propagated to future FPGA generations and/or to custom application-specific integrated circuits.
Yang, Changju; Kim, Hyongsuk; Adhikari, Shyam Prasad; Chua, Leon O.
2016-01-01
A hybrid learning method of a software-based backpropagation learning and a hardware-based RWC learning is proposed for the development of circuit-based neural networks. The backpropagation is known as one of the most efficient learning algorithms. A weak point is that its hardware implementation is extremely difficult. The RWC algorithm, which is very easy to implement with respect to its hardware circuits, takes too many iterations for learning. The proposed learning algorithm is a hybrid one of these two. The main learning is performed with a software version of the BP algorithm, firstly, and then, learned weights are transplanted on a hardware version of a neural circuit. At the time of the weight transplantation, a significant amount of output error would occur due to the characteristic difference between the software and the hardware. In the proposed method, such error is reduced via a complementary learning of the RWC algorithm, which is implemented in a simple hardware. The usefulness of the proposed hybrid learning system is verified via simulations upon several classical learning problems. PMID:28025566
An Application Development Platform for Neuromorphic Computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dean, Mark; Chan, Jason; Daffron, Christopher
2016-01-01
Dynamic Adaptive Neural Network Arrays (DANNAs) are neuromorphic computing systems developed as a hardware based approach to the implementation of neural networks. They feature highly adaptive and programmable structural elements, which model arti cial neural networks with spiking behavior. We design them to solve problems using evolutionary optimization. In this paper, we highlight the current hardware and software implementations of DANNA, including their features, functionalities and performance. We then describe the development of an Application Development Platform (ADP) to support efficient application implementation and testing of DANNA based solutions. We conclude with future directions.
Binary Associative Memories as a Benchmark for Spiking Neuromorphic Hardware
Stöckel, Andreas; Jenzen, Christoph; Thies, Michael; Rückert, Ulrich
2017-01-01
Large-scale neuromorphic hardware platforms, specialized computer systems for energy efficient simulation of spiking neural networks, are being developed around the world, for example as part of the European Human Brain Project (HBP). Due to conceptual differences, a universal performance analysis of these systems in terms of runtime, accuracy and energy efficiency is non-trivial, yet indispensable for further hard- and software development. In this paper we describe a scalable benchmark based on a spiking neural network implementation of the binary neural associative memory. We treat neuromorphic hardware and software simulators as black-boxes and execute exactly the same network description across all devices. Experiments on the HBP platforms under varying configurations of the associative memory show that the presented method allows to test the quality of the neuron model implementation, and to explain significant deviations from the expected reference output. PMID:28878642
Sun, Yuwen; Cheng, Allen C
2012-07-01
Artificial neural networks (ANNs) are a promising machine learning technique in classifying non-linear electrocardiogram (ECG) signals and recognizing abnormal patterns suggesting risks of cardiovascular diseases (CVDs). In this paper, we propose a new reusable neuron architecture (RNA) enabling a performance-efficient and cost-effective silicon implementation for ANN. The RNA architecture consists of a single layer of physical RNA neurons, each of which is designed to use minimal hardware resource (e.g., a single 2-input multiplier-accumulator is used to compute the dot product of two vectors). By carefully applying the principal of time sharing, RNA can multiplexs this single layer of physical neurons to efficiently execute both feed-forward and back-propagation computations of an ANN while conserving the area and reducing the power dissipation of the silicon. A three-layer 51-30-12 ANN is implemented in RNA to perform the ECG classification for CVD detection. This RNA hardware also allows on-chip automatic training update. A quantitative design space exploration in area, power dissipation, and execution speed between RNA and three other implementations representative of different reusable hardware strategies is presented and discussed. Compared with an equivalent software implementation in C executed on an embedded microprocessor, the RNA ASIC achieves three orders of magnitude improvements in both the execution speed and the energy efficiency. Copyright © 2012 Elsevier Ltd. All rights reserved.
Hardware realization of an SVM algorithm implemented in FPGAs
NASA Astrophysics Data System (ADS)
Wiśniewski, Remigiusz; Bazydło, Grzegorz; Szcześniak, Paweł
2017-08-01
The paper proposes a technique of hardware realization of a space vector modulation (SVM) of state function switching in matrix converter (MC), oriented on the implementation in a single field programmable gate array (FPGA). In MC the SVM method is based on the instantaneous space-vector representation of input currents and output voltages. The traditional computation algorithms usually involve digital signal processors (DSPs) which consumes the large number of power transistors (18 transistors and 18 independent PWM outputs) and "non-standard positions of control pulses" during the switching sequence. Recently, hardware implementations become popular since computed operations may be executed much faster and efficient due to nature of the digital devices (especially concurrency). In the paper, we propose a hardware algorithm of SVM computation. In opposite to the existing techniques, the presented solution applies COordinate Rotation DIgital Computer (CORDIC) method to solve the trigonometric operations. Furthermore, adequate arithmetic modules (that is, sub-devices) used for intermediate calculations, such as code converters or proper sectors selectors (for output voltages and input current) are presented in detail. The proposed technique has been implemented as a design described with the use of Verilog hardware description language. The preliminary results of logic implementation oriented on the Xilinx FPGA (particularly, low-cost device from Artix-7 family from Xilinx was used) are also presented.
An FPGA-Based People Detection System
NASA Astrophysics Data System (ADS)
Nair, Vinod; Laprise, Pierre-Olivier; Clark, James J.
2005-12-01
This paper presents an FPGA-based system for detecting people from video. The system is designed to use JPEG-compressed frames from a network camera. Unlike previous approaches that use techniques such as background subtraction and motion detection, we use a machine-learning-based approach to train an accurate detector. We address the hardware design challenges involved in implementing such a detector, along with JPEG decompression, on an FPGA. We also present an algorithm that efficiently combines JPEG decompression with the detection process. This algorithm carries out the inverse DCT step of JPEG decompression only partially. Therefore, it is computationally more efficient and simpler to implement, and it takes up less space on the chip than the full inverse DCT algorithm. The system is demonstrated on an automated video surveillance application and the performance of both hardware and software implementations is analyzed. The results show that the system can detect people accurately at a rate of about[InlineEquation not available: see fulltext.] frames per second on a Virtex-II 2V1000 using a MicroBlaze processor running at[InlineEquation not available: see fulltext.], communicating with dedicated hardware over FSL links.
Paraskevopoulou, Sivylla E; Barsakcioglu, Deren Y; Saberi, Mohammed R; Eftekhar, Amir; Constandinou, Timothy G
2013-04-30
Next generation neural interfaces aspire to achieve real-time multi-channel systems by integrating spike sorting on chip to overcome limitations in communication channel capacity. The feasibility of this approach relies on developing highly efficient algorithms for feature extraction and clustering with the potential of low-power hardware implementation. We are proposing a feature extraction method, not requiring any calibration, based on first and second derivative features of the spike waveform. The accuracy and computational complexity of the proposed method are quantified and compared against commonly used feature extraction methods, through simulation across four datasets (with different single units) at multiple noise levels (ranging from 5 to 20% of the signal amplitude). The average classification error is shown to be below 7% with a computational complexity of 2N-3, where N is the number of sample points of each spike. Overall, this method presents a good trade-off between accuracy and computational complexity and is thus particularly well-suited for hardware-efficient implementation. Copyright © 2013 Elsevier B.V. All rights reserved.
Efficient k-Winner-Take-All Competitive Learning Hardware Architecture for On-Chip Learning
Ou, Chien-Min; Li, Hui-Ya; Hwang, Wen-Jyi
2012-01-01
A novel k-winners-take-all (k-WTA) competitive learning (CL) hardware architecture is presented for on-chip learning in this paper. The architecture is based on an efficient pipeline allowing k-WTA competition processes associated with different training vectors to be performed concurrently. The pipeline architecture employs a novel codeword swapping scheme so that neurons failing the competition for a training vector are immediately available for the competitions for the subsequent training vectors. The architecture is implemented by the field programmable gate array (FPGA). It is used as a hardware accelerator in a system on programmable chip (SOPC) for realtime on-chip learning. Experimental results show that the SOPC has significantly lower training time than that of other k-WTA CL counterparts operating with or without hardware support.
A high throughput architecture for a low complexity soft-output demapping algorithm
NASA Astrophysics Data System (ADS)
Ali, I.; Wasenmüller, U.; Wehn, N.
2015-11-01
Iterative channel decoders such as Turbo-Code and LDPC decoders show exceptional performance and therefore they are a part of many wireless communication receivers nowadays. These decoders require a soft input, i.e., the logarithmic likelihood ratio (LLR) of the received bits with a typical quantization of 4 to 6 bits. For computing the LLR values from a received complex symbol, a soft demapper is employed in the receiver. The implementation cost of traditional soft-output demapping methods is relatively large in high order modulation systems, and therefore low complexity demapping algorithms are indispensable in low power receivers. In the presence of multiple wireless communication standards where each standard defines multiple modulation schemes, there is a need to have an efficient demapper architecture covering all the flexibility requirements of these standards. Another challenge associated with hardware implementation of the demapper is to achieve a very high throughput in double iterative systems, for instance, MIMO and Code-Aided Synchronization. In this paper, we present a comprehensive communication and hardware performance evaluation of low complexity soft-output demapping algorithms to select the best algorithm for implementation. The main goal of this work is to design a high throughput, flexible, and area efficient architecture. We describe architectures to execute the investigated algorithms. We implement these architectures on a FPGA device to evaluate their hardware performance. The work has resulted in a hardware architecture based on the figured out best low complexity algorithm delivering a high throughput of 166 Msymbols/second for Gray mapped 16-QAM modulation on Virtex-5. This efficient architecture occupies only 127 slice registers, 248 slice LUTs and 2 DSP48Es.
Efficient FIR Filter Implementations for Multichannel BCIs Using Xilinx System Generator.
Ghani, Usman; Wasim, Muhammad; Khan, Umar Shahbaz; Mubasher Saleem, Muhammad; Hassan, Ali; Rashid, Nasir; Islam Tiwana, Mohsin; Hamza, Amir; Kashif, Amir
2018-01-01
Background . Brain computer interface (BCI) is a combination of software and hardware communication protocols that allow brain to control external devices. Main purpose of BCI controlled external devices is to provide communication medium for disabled persons. Now these devices are considered as a new way to rehabilitate patients with impunities. There are certain potentials present in electroencephalogram (EEG) that correspond to specific event. Main issue is to detect such event related potentials online in such a low signal to noise ratio (SNR). In this paper we propose a method that will facilitate the concept of online processing by providing an efficient filtering implementation in a hardware friendly environment by switching to finite impulse response (FIR). Main focus of this research is to minimize latency and computational delay of preprocessing related to any BCI application. Four different finite impulse response (FIR) implementations along with large Laplacian filter are implemented in Xilinx System Generator. Efficiency of 25% is achieved in terms of reduced number of coefficients and multiplications which in turn reduce computational delays accordingly.
NASA Astrophysics Data System (ADS)
Yang, Shuangming; Wei, Xile; Deng, Bin; Liu, Chen; Li, Huiyan; Wang, Jiang
2018-03-01
Balance between biological plausibility of dynamical activities and computational efficiency is one of challenging problems in computational neuroscience and neural system engineering. This paper proposes a set of efficient methods for the hardware realization of the conductance-based neuron model with relevant dynamics, targeting reproducing the biological behaviors with low-cost implementation on digital programmable platform, which can be applied in wide range of conductance-based neuron models. Modified GP neuron models for efficient hardware implementation are presented to reproduce reliable pallidal dynamics, which decode the information of basal ganglia and regulate the movement disorder related voluntary activities. Implementation results on a field-programmable gate array (FPGA) demonstrate that the proposed techniques and models can reduce the resource cost significantly and reproduce the biological dynamics accurately. Besides, the biological behaviors with weak network coupling are explored on the proposed platform, and theoretical analysis is also made for the investigation of biological characteristics of the structured pallidal oscillator and network. The implementation techniques provide an essential step towards the large-scale neural network to explore the dynamical mechanisms in real time. Furthermore, the proposed methodology enables the FPGA-based system a powerful platform for the investigation on neurodegenerative diseases and real-time control of bio-inspired neuro-robotics.
Design and implementation of encrypted and decrypted file system based on USBKey and hardware code
NASA Astrophysics Data System (ADS)
Wu, Kehe; Zhang, Yakun; Cui, Wenchao; Jiang, Ting
2017-05-01
To protect the privacy of sensitive data, an encrypted and decrypted file system based on USBKey and hardware code is designed and implemented in this paper. This system uses USBKey and hardware code to authenticate a user. We use random key to encrypt file with symmetric encryption algorithm and USBKey to encrypt random key with asymmetric encryption algorithm. At the same time, we use the MD5 algorithm to calculate the hash of file to verify its integrity. Experiment results show that large files can be encrypted and decrypted in a very short time. The system has high efficiency and ensures the security of documents.
NASA Astrophysics Data System (ADS)
Zhang, Yuli; Han, Jun; Weng, Xinqian; He, Zhongzhu; Zeng, Xiaoyang
This paper presents an Application Specific Instruction-set Processor (ASIP) for the SHA-3 BLAKE algorithm family by instruction set extensions (ISE) from an RISC (reduced instruction set computer) processor. With a design space exploration for this ASIP to increase the performance and reduce the area cost, we accomplish an efficient hardware and software implementation of BLAKE algorithm. The special instructions and their well-matched hardware function unit improve the calculation of the key section of the algorithm, namely G-functions. Also, relaxing the time constraint of the special function unit can decrease its hardware cost, while keeping the high data throughput of the processor. Evaluation results reveal the ASIP achieves 335Mbps and 176Mbps for BLAKE-256 and BLAKE-512. The extra area cost is only 8.06k equivalent gates. The proposed ASIP outperforms several software approaches on various platforms in cycle per byte. In fact, both high throughput and low hardware cost achieved by this programmable processor are comparable to that of ASIC implementations.
Bandwidth Efficient Wireless Digital Modem Developed
NASA Technical Reports Server (NTRS)
Kifle, Muli
1999-01-01
NASA Lewis Research Center has developed a digital approach for broadcasting highfidelity audio (nearly compact disk (CD) quality sound) in the commercial frequencymodulated (FM) broadcast band. This digital approach provides a means of achieving high data transmission rates with low hardware complexity--including low mass, size, and power consumption. Lewis has completed the design and prototype development of a bandwidth-efficient digital modem (modulator and demodulator) that uses a spectrally efficient modulation scheme: 16-ary rectangular quadrature amplitude modulation, or 16- ary QAM. The digital implementation is based strictly on inexpensive, commercial off-theshelf digital signal processing (DSP) hardware to perform up and down conversions and pulse shaping. The digital modem transmits data at rates up to 76 kilobits per second (kbps), which is almost 3 times faster than standard 28.8-kbps telephone modems. In addition, the modem offers improved power and spectral performance, flexible operation, and low-cost implementation.
CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment
Manavski, Svetlin A; Valle, Giorgio
2008-01-01
Background Searching for similarities in protein and DNA databases has become a routine procedure in Molecular Biology. The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. Furthermore, the exponential growth of protein and DNA databases makes the Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. For these reasons heuristic approaches such as those implemented in FASTA and BLAST tend to be preferred, allowing faster execution times at the cost of reduced sensitivity. The main motivation of our work is to exploit the huge computational power of commonly available graphic cards, to develop high performance solutions for sequence alignment. Results In this paper we present what we believe is the fastest solution of the exact Smith-Waterman algorithm running on commodity hardware. It is implemented in the recently released CUDA programming environment by NVidia. CUDA allows direct access to the hardware primitives of the last-generation Graphics Processing Units (GPU) G80. Speeds of more than 3.5 GCUPS (Giga Cell Updates Per Second) are achieved on a workstation running two GeForce 8800 GTX. Exhaustive tests have been done to compare our implementation to SSEARCH and BLAST, running on a 3 GHz Intel Pentium IV processor. Our solution was also compared to a recently published GPU implementation and to a Single Instruction Multiple Data (SIMD) solution. These tests show that our implementation performs from 2 to 30 times faster than any other previous attempt available on commodity hardware. Conclusions The results show that graphic cards are now sufficiently advanced to be used as efficient hardware accelerators for sequence alignment. Their performance is better than any alternative available on commodity hardware platforms. The solution presented in this paper allows large scale alignments to be performed at low cost, using the exact Smith-Waterman algorithm instead of the largely adopted heuristic approaches. PMID:18387198
Advanced techniques and technology for efficient data storage, access, and transfer
NASA Technical Reports Server (NTRS)
Rice, Robert F.; Miller, Warner
1991-01-01
Advanced techniques for efficiently representing most forms of data are being implemented in practical hardware and software form through the joint efforts of three NASA centers. These techniques adapt to local statistical variations to continually provide near optimum code efficiency when representing data without error. Demonstrated in several earlier space applications, these techniques are the basis of initial NASA data compression standards specifications. Since the techniques clearly apply to most NASA science data, NASA invested in the development of both hardware and software implementations for general use. This investment includes high-speed single-chip very large scale integration (VLSI) coding and decoding modules as well as machine-transferrable software routines. The hardware chips were tested in the laboratory at data rates as high as 700 Mbits/s. A coding module's definition includes a predictive preprocessing stage and a powerful adaptive coding stage. The function of the preprocessor is to optimally process incoming data into a standard form data source that the second stage can handle.The built-in preprocessor of the VLSI coder chips is ideal for high-speed sampled data applications such as imaging and high-quality audio, but additionally, the second stage adaptive coder can be used separately with any source that can be externally preprocessed into the 'standard form'. This generic functionality assures that the applicability of these techniques and their recent high-speed implementations should be equally broad outside of NASA.
DANoC: An Efficient Algorithm and Hardware Codesign of Deep Neural Networks on Chip.
Zhou, Xichuan; Li, Shengli; Tang, Fang; Hu, Shengdong; Lin, Zhi; Zhang, Lei
2017-07-18
Deep neural networks (NNs) are the state-of-the-art models for understanding the content of images and videos. However, implementing deep NNs in embedded systems is a challenging task, e.g., a typical deep belief network could exhaust gigabytes of memory and result in bandwidth and computational bottlenecks. To address this challenge, this paper presents an algorithm and hardware codesign for efficient deep neural computation. A hardware-oriented deep learning algorithm, named the deep adaptive network, is proposed to explore the sparsity of neural connections. By adaptively removing the majority of neural connections and robustly representing the reserved connections using binary integers, the proposed algorithm could save up to 99.9% memory utility and computational resources without undermining classification accuracy. An efficient sparse-mapping-memory-based hardware architecture is proposed to fully take advantage of the algorithmic optimization. Different from traditional Von Neumann architecture, the deep-adaptive network on chip (DANoC) brings communication and computation in close proximity to avoid power-hungry parameter transfers between on-board memory and on-chip computational units. Experiments over different image classification benchmarks show that the DANoC system achieves competitively high accuracy and efficiency comparing with the state-of-the-art approaches.
Komorkiewicz, Mateusz; Kryjak, Tomasz; Gorgon, Marek
2014-01-01
This article presents an efficient hardware implementation of the Horn-Schunck algorithm that can be used in an embedded optical flow sensor. An architecture is proposed, that realises the iterative Horn-Schunck algorithm in a pipelined manner. This modification allows to achieve data throughput of 175 MPixels/s and makes processing of Full HD video stream (1, 920 × 1, 080 @ 60 fps) possible. The structure of the optical flow module as well as pre- and post-filtering blocks and a flow reliability computation unit is described in details. Three versions of optical flow modules, with different numerical precision, working frequency and obtained results accuracy are proposed. The errors caused by switching from floating- to fixed-point computations are also evaluated. The described architecture was tested on popular sequences from an optical flow dataset of the Middlebury University. It achieves state-of-the-art results among hardware implementations of single scale methods. The designed fixed-point architecture achieves performance of 418 GOPS with power efficiency of 34 GOPS/W. The proposed floating-point module achieves 103 GFLOPS, with power efficiency of 24 GFLOPS/W. Moreover, a 100 times speedup compared to a modern CPU with SIMD support is reported. A complete, working vision system realized on Xilinx VC707 evaluation board is also presented. It is able to compute optical flow for Full HD video stream received from an HDMI camera in real-time. The obtained results prove that FPGA devices are an ideal platform for embedded vision systems. PMID:24526303
Komorkiewicz, Mateusz; Kryjak, Tomasz; Gorgon, Marek
2014-02-12
This article presents an efficient hardware implementation of the Horn-Schunck algorithm that can be used in an embedded optical flow sensor. An architecture is proposed, that realises the iterative Horn-Schunck algorithm in a pipelined manner. This modification allows to achieve data throughput of 175 MPixels/s and makes processing of Full HD video stream (1; 920 × 1; 080 @ 60 fps) possible. The structure of the optical flow module as well as pre- and post-filtering blocks and a flow reliability computation unit is described in details. Three versions of optical flow modules, with different numerical precision, working frequency and obtained results accuracy are proposed. The errors caused by switching from floating- to fixed-point computations are also evaluated. The described architecture was tested on popular sequences from an optical flow dataset of the Middlebury University. It achieves state-of-the-art results among hardware implementations of single scale methods. The designed fixed-point architecture achieves performance of 418 GOPS with power efficiency of 34 GOPS/W. The proposed floating-point module achieves 103 GFLOPS, with power efficiency of 24 GFLOPS/W. Moreover, a 100 times speedup compared to a modern CPU with SIMD support is reported. A complete, working vision system realized on Xilinx VC707 evaluation board is also presented. It is able to compute optical flow for Full HD video stream received from an HDMI camera in real-time. The obtained results prove that FPGA devices are an ideal platform for embedded vision systems.
The design and hardware implementation of a low-power real-time seizure detection algorithm
NASA Astrophysics Data System (ADS)
Raghunathan, Shriram; Gupta, Sumeet K.; Ward, Matthew P.; Worth, Robert M.; Roy, Kaushik; Irazoqui, Pedro P.
2009-10-01
Epilepsy affects more than 1% of the world's population. Responsive neurostimulation is emerging as an alternative therapy for the 30% of the epileptic patient population that does not benefit from pharmacological treatment. Efficient seizure detection algorithms will enable closed-loop epilepsy prostheses by stimulating the epileptogenic focus within an early onset window. Critically, this is expected to reduce neuronal desensitization over time and lead to longer-term device efficacy. This work presents a novel event-based seizure detection algorithm along with a low-power digital circuit implementation. Hippocampal depth-electrode recordings from six kainate-treated rats are used to validate the algorithm and hardware performance in this preliminary study. The design process illustrates crucial trade-offs in translating mathematical models into hardware implementations and validates statistical optimizations made with empirical data analyses on results obtained using a real-time functioning hardware prototype. Using quantitatively predicted thresholds from the depth-electrode recordings, the auto-updating algorithm performs with an average sensitivity and selectivity of 95.3 ± 0.02% and 88.9 ± 0.01% (mean ± SEα = 0.05), respectively, on untrained data with a detection delay of 8.5 s [5.97, 11.04] from electrographic onset. The hardware implementation is shown feasible using CMOS circuits consuming under 350 nW of power from a 250 mV supply voltage from simulations on the MIT 180 nm SOI process.
Digital CODEC for real-time processing of broadcast quality video signals at 1.8 bits/pixel
NASA Technical Reports Server (NTRS)
Shalkhauser, Mary JO; Whyte, Wayne A., Jr.
1989-01-01
Advances in very large-scale integration and recent work in the field of bandwidth efficient digital modulation techniques have combined to make digital video processing technically feasible and potentially cost competitive for broadcast quality television transmission. A hardware implementation was developed for a DPCM-based digital television bandwidth compression algorithm which processes standard NTSC composite color television signals and produces broadcast quality video in real time at an average of 1.8 bits/pixel. The data compression algorithm and the hardware implementation of the CODEC are described, and performance results are provided.
Digital CODEC for real-time processing of broadcast quality video signals at 1.8 bits/pixel
NASA Technical Reports Server (NTRS)
Shalkhauser, Mary JO; Whyte, Wayne A.
1991-01-01
Advances in very large scale integration and recent work in the field of bandwidth efficient digital modulation techniques have combined to make digital video processing technically feasible an potentially cost competitive for broadcast quality television transmission. A hardware implementation was developed for DPCM (differential pulse code midulation)-based digital television bandwidth compression algorithm which processes standard NTSC composite color television signals and produces broadcast quality video in real time at an average of 1.8 bits/pixel. The data compression algorithm and the hardware implementation of the codec are described, and performance results are provided.
Embedded algorithms within an FPGA-based system to process nonlinear time series data
NASA Astrophysics Data System (ADS)
Jones, Jonathan D.; Pei, Jin-Song; Tull, Monte P.
2008-03-01
This paper presents some preliminary results of an ongoing project. A pattern classification algorithm is being developed and embedded into a Field-Programmable Gate Array (FPGA) and microprocessor-based data processing core in this project. The goal is to enable and optimize the functionality of onboard data processing of nonlinear, nonstationary data for smart wireless sensing in structural health monitoring. Compared with traditional microprocessor-based systems, fast growing FPGA technology offers a more powerful, efficient, and flexible hardware platform including on-site (field-programmable) reconfiguration capability of hardware. An existing nonlinear identification algorithm is used as the baseline in this study. The implementation within a hardware-based system is presented in this paper, detailing the design requirements, validation, tradeoffs, optimization, and challenges in embedding this algorithm. An off-the-shelf high-level abstraction tool along with the Matlab/Simulink environment is utilized to program the FPGA, rather than coding the hardware description language (HDL) manually. The implementation is validated by comparing the simulation results with those from Matlab. In particular, the Hilbert Transform is embedded into the FPGA hardware and applied to the baseline algorithm as the centerpiece in processing nonlinear time histories and extracting instantaneous features of nonstationary dynamic data. The selection of proper numerical methods for the hardware execution of the selected identification algorithm and consideration of the fixed-point representation are elaborated. Other challenges include the issues of the timing in the hardware execution cycle of the design, resource consumption, approximation accuracy, and user flexibility of input data types limited by the simplicity of this preliminary design. Future work includes making an FPGA and microprocessor operate together to embed a further developed algorithm that yields better computational and power efficiency.
Demonstration of a High-Efficiency Free-Space Optical Communications Link
NASA Technical Reports Server (NTRS)
Birnbaum, Kevin; Farr, William; Gin, Jonathan; Moision, Bruce; Quirk, Kevin; Wright, Malcolm
2009-01-01
In this paper we discuss recent progress on the implementation of a hardware free-space optical communications test-bed. The test-bed implements an end-to-end communications system comprising a data encoder, modulator, laser-transmitter, telescope, detector, receiver and error-correction-code decoder. Implementation of each of the component systems is discussed, with an emphasis on 'real-world' system performance degradation and limitations. We have demonstrated real-time data rates of 44 Mbps and photon efficiencies of approximately 1.8 bits/photon over a 100m free-space optical link.
Efficient hash tables for network applications.
Zink, Thomas; Waldvogel, Marcel
2015-01-01
Hashing has yet to be widely accepted as a component of hard real-time systems and hardware implementations, due to still existing prejudices concerning the unpredictability of space and time requirements resulting from collisions. While in theory perfect hashing can provide optimal mapping, in practice, finding a perfect hash function is too expensive, especially in the context of high-speed applications. The introduction of hashing with multiple choices, d-left hashing and probabilistic table summaries, has caused a shift towards deterministic DRAM access. However, high amounts of rare and expensive high-speed SRAM need to be traded off for predictability, which is infeasible for many applications. In this paper we show that previous suggestions suffer from the false precondition of full generality. Our approach exploits four individual degrees of freedom available in many practical applications, especially hardware and high-speed lookups. This reduces the requirement of on-chip memory up to an order of magnitude and guarantees constant lookup and update time at the cost of only minute amounts of additional hardware. Our design makes efficient hash table implementations cheaper, more predictable, and more practical.
Adaptive controller for volumetric display of neuroimaging studies
NASA Astrophysics Data System (ADS)
Bleiberg, Ben; Senseney, Justin; Caban, Jesus
2014-03-01
Volumetric display of medical images is an increasingly relevant method for examining an imaging acquisition as the prevalence of thin-slice imaging increases in clinical studies. Current mouse and keyboard implementations for volumetric control provide neither the sensitivity nor specificity required to manipulate a volumetric display for efficient reading in a clinical setting. Solutions to efficient volumetric manipulation provide more sensitivity by removing the binary nature of actions controlled by keyboard clicks, but specificity is lost because a single action may change display in several directions. When specificity is then further addressed by re-implementing hardware binary functions through the introduction of mode control, the result is a cumbersome interface that fails to achieve the revolutionary benefit required for adoption of a new technology. We address the specificity versus sensitivity problem of volumetric interfaces by providing adaptive positional awareness to the volumetric control device by manipulating communication between hardware driver and existing software methods for volumetric display of medical images. This creates a tethered effect for volumetric display, providing a smooth interface that improves on existing hardware approaches to volumetric scene manipulation.
NASA Astrophysics Data System (ADS)
Zaveri, Mazad Shaheriar
The semiconductor/computer industry has been following Moore's law for several decades and has reaped the benefits in speed and density of the resultant scaling. Transistor density has reached almost one billion per chip, and transistor delays are in picoseconds. However, scaling has slowed down, and the semiconductor industry is now facing several challenges. Hybrid CMOS/nano technologies, such as CMOL, are considered as an interim solution to some of the challenges. Another potential architectural solution includes specialized architectures for applications/models in the intelligent computing domain, one aspect of which includes abstract computational models inspired from the neuro/cognitive sciences. Consequently in this dissertation, we focus on the hardware implementations of Bayesian Memory (BM), which is a (Bayesian) Biologically Inspired Computational Model (BICM). This model is a simplified version of George and Hawkins' model of the visual cortex, which includes an inference framework based on Judea Pearl's belief propagation. We then present a "hardware design space exploration" methodology for implementing and analyzing the (digital and mixed-signal) hardware for the BM. This particular methodology involves: analyzing the computational/operational cost and the related micro-architecture, exploring candidate hardware components, proposing various custom hardware architectures using both traditional CMOS and hybrid nanotechnology - CMOL, and investigating the baseline performance/price of these architectures. The results suggest that CMOL is a promising candidate for implementing a BM. Such implementations can utilize the very high density storage/computation benefits of these new nano-scale technologies much more efficiently; for example, the throughput per 858 mm2 (TPM) obtained for CMOL based architectures is 32 to 40 times better than the TPM for a CMOS based multiprocessor/multi-FPGA system, and almost 2000 times better than the TPM for a PC implementation. We later use this methodology to investigate the hardware implementations of cortex-scale spiking neural system, which is an approximate neural equivalent of BICM based cortex-scale system. The results of this investigation also suggest that CMOL is a promising candidate to implement such large-scale neuromorphic systems. In general, the assessment of such hypothetical baseline hardware architectures provides the prospects for building large-scale (mammalian cortex-scale) implementations of neuromorphic/Bayesian/intelligent systems using state-of-the-art and beyond state-of-the-art silicon structures.
Compressive Sensing Based Bio-Inspired Shape Feature Detection CMOS Imager
NASA Technical Reports Server (NTRS)
Duong, Tuan A. (Inventor)
2015-01-01
A CMOS imager integrated circuit using compressive sensing and bio-inspired detection is presented which integrates novel functions and algorithms within a novel hardware architecture enabling efficient on-chip implementation.
VME rollback hardware for time warp multiprocessor systems
NASA Technical Reports Server (NTRS)
Robb, Michael J.; Buzzell, Calvin A.
1992-01-01
The purpose of the research effort is to develop and demonstrate innovative hardware to implement specific rollback and timing functions required for efficient queue management and precision timekeeping in multiprocessor discrete event simulations. The previously completed phase 1 effort demonstrated the technical feasibility of building hardware modules which eliminate the state saving overhead of the Time Warp paradigm used in distributed simulations on multiprocessor systems. The current phase 2 effort will build multiple pre-production rollback hardware modules integrated with a network of Sun workstations, and the integrated system will be tested by executing a Time Warp simulation. The rollback hardware will be designed to interface with the greatest number of multiprocessor systems possible. The authors believe that the rollback hardware will provide for significant speedup of large scale discrete event simulation problems and allow multiprocessors using Time Warp to dramatically increase performance.
Hardware architecture design of image restoration based on time-frequency domain computation
NASA Astrophysics Data System (ADS)
Wen, Bo; Zhang, Jing; Jiao, Zipeng
2013-10-01
The image restoration algorithms based on time-frequency domain computation is high maturity and applied widely in engineering. To solve the high-speed implementation of these algorithms, the TFDC hardware architecture is proposed. Firstly, the main module is designed, by analyzing the common processing and numerical calculation. Then, to improve the commonality, the iteration control module is planed for iterative algorithms. In addition, to reduce the computational cost and memory requirements, the necessary optimizations are suggested for the time-consuming module, which include two-dimensional FFT/IFFT and the plural calculation. Eventually, the TFDC hardware architecture is adopted for hardware design of real-time image restoration system. The result proves that, the TFDC hardware architecture and its optimizations can be applied to image restoration algorithms based on TFDC, with good algorithm commonality, hardware realizability and high efficiency.
Gálvez, Sergio; Ferusic, Adis; Esteban, Francisco J; Hernández, Pilar; Caballero, Juan A; Dorado, Gabriel
2016-10-01
The Smith-Waterman algorithm has a great sensitivity when used for biological sequence-database searches, but at the expense of high computing-power requirements. To overcome this problem, there are implementations in literature that exploit the different hardware-architectures available in a standard PC, such as GPU, CPU, and coprocessors. We introduce an application that splits the original database-search problem into smaller parts, resolves each of them by executing the most efficient implementations of the Smith-Waterman algorithms in different hardware architectures, and finally unifies the generated results. Using non-overlapping hardware allows simultaneous execution, and up to 2.58-fold performance gain, when compared with any other algorithm to search sequence databases. Even the performance of the popular BLAST heuristic is exceeded in 78% of the tests. The application has been tested with standard hardware: Intel i7-4820K CPU, Intel Xeon Phi 31S1P coprocessors, and nVidia GeForce GTX 960 graphics cards. An important increase in performance has been obtained in a wide range of situations, effectively exploiting the available hardware.
SIMD Optimization of Linear Expressions for Programmable Graphics Hardware
Bajaj, Chandrajit; Ihm, Insung; Min, Jungki; Oh, Jinsang
2009-01-01
The increased programmability of graphics hardware allows efficient graphical processing unit (GPU) implementations of a wide range of general computations on commodity PCs. An important factor in such implementations is how to fully exploit the SIMD computing capacities offered by modern graphics processors. Linear expressions in the form of ȳ = Ax̄ + b̄, where A is a matrix, and x̄, ȳ and b̄ are vectors, constitute one of the most basic operations in many scientific computations. In this paper, we propose a SIMD code optimization technique that enables efficient shader codes to be generated for evaluating linear expressions. It is shown that performance can be improved considerably by efficiently packing arithmetic operations into four-wide SIMD instructions through reordering of the operations in linear expressions. We demonstrate that the presented technique can be used effectively for programming both vertex and pixel shaders for a variety of mathematical applications, including integrating differential equations and solving a sparse linear system of equations using iterative methods. PMID:19946569
Efficient Bit-to-Symbol Likelihood Mappings
NASA Technical Reports Server (NTRS)
Moision, Bruce E.; Nakashima, Michael A.
2010-01-01
This innovation is an efficient algorithm designed to perform bit-to-symbol and symbol-to-bit likelihood mappings that represent a significant portion of the complexity of an error-correction code decoder for high-order constellations. Recent implementation of the algorithm in hardware has yielded an 8- percent reduction in overall area relative to the prior design.
Stream-based Hebbian eigenfilter for real-time neuronal spike discrimination
2012-01-01
Background Principal component analysis (PCA) has been widely employed for automatic neuronal spike sorting. Calculating principal components (PCs) is computationally expensive, and requires complex numerical operations and large memory resources. Substantial hardware resources are therefore needed for hardware implementations of PCA. General Hebbian algorithm (GHA) has been proposed for calculating PCs of neuronal spikes in our previous work, which eliminates the needs of computationally expensive covariance analysis and eigenvalue decomposition in conventional PCA algorithms. However, large memory resources are still inherently required for storing a large volume of aligned spikes for training PCs. The large size memory will consume large hardware resources and contribute significant power dissipation, which make GHA difficult to be implemented in portable or implantable multi-channel recording micro-systems. Method In this paper, we present a new algorithm for PCA-based spike sorting based on GHA, namely stream-based Hebbian eigenfilter, which eliminates the inherent memory requirements of GHA while keeping the accuracy of spike sorting by utilizing the pseudo-stationarity of neuronal spikes. Because of the reduction of large hardware storage requirements, the proposed algorithm can lead to ultra-low hardware resources and power consumption of hardware implementations, which is critical for the future multi-channel micro-systems. Both clinical and synthetic neural recording data sets were employed for evaluating the accuracy of the stream-based Hebbian eigenfilter. The performance of spike sorting using stream-based eigenfilter and the computational complexity of the eigenfilter were rigorously evaluated and compared with conventional PCA algorithms. Field programmable logic arrays (FPGAs) were employed to implement the proposed algorithm, evaluate the hardware implementations and demonstrate the reduction in both power consumption and hardware memories achieved by the streaming computing Results and discussion Results demonstrate that the stream-based eigenfilter can achieve the same accuracy and is 10 times more computationally efficient when compared with conventional PCA algorithms. Hardware evaluations show that 90.3% logic resources, 95.1% power consumption and 86.8% computing latency can be reduced by the stream-based eigenfilter when compared with PCA hardware. By utilizing the streaming method, 92% memory resources and 67% power consumption can be saved when compared with the direct implementation of GHA. Conclusion Stream-based Hebbian eigenfilter presents a novel approach to enable real-time spike sorting with reduced computational complexity and hardware costs. This new design can be further utilized for multi-channel neuro-physiological experiments or chronic implants. PMID:22490725
Compressive Sensing Image Sensors-Hardware Implementation
Dadkhah, Mohammadreza; Deen, M. Jamal; Shirani, Shahram
2013-01-01
The compressive sensing (CS) paradigm uses simultaneous sensing and compression to provide an efficient image acquisition technique. The main advantages of the CS method include high resolution imaging using low resolution sensor arrays and faster image acquisition. Since the imaging philosophy in CS imagers is different from conventional imaging systems, new physical structures have been developed for cameras that use the CS technique. In this paper, a review of different hardware implementations of CS encoding in optical and electrical domains is presented. Considering the recent advances in CMOS (complementary metal–oxide–semiconductor) technologies and the feasibility of performing on-chip signal processing, important practical issues in the implementation of CS in CMOS sensors are emphasized. In addition, the CS coding for video capture is discussed. PMID:23584123
Hardware Implementation of a MIMO Decoder Using Matrix Factorization Based Channel Estimation
NASA Astrophysics Data System (ADS)
Islam, Mohammad Tariqul; Numan, Mostafa Wasiuddin; Misran, Norbahiah; Ali, Mohd Alauddin Mohd; Singh, Mandeep
2011-05-01
This paper presents an efficient hardware realization of multiple-input multiple-output (MIMO) wireless communication decoder that utilizes the available resources by adopting the technique of parallelism. The hardware is designed and implemented on Xilinx Virtex™-4 XC4VLX60 field programmable gate arrays (FPGA) device in a modular approach which simplifies and eases hardware update, and facilitates testing of the various modules independently. The decoder involves a proficient channel estimation module that employs matrix factorization on least squares (LS) estimation to reduce a full rank matrix into a simpler form in order to eliminate matrix inversion. This results in performance improvement and complexity reduction of the MIMO system. Performance evaluation of the proposed method is validated through MATLAB simulations which indicate 2 dB improvement in terms of SNR compared to LS estimation. Moreover complexity comparison is performed in terms of mathematical operations, which shows that the proposed approach appreciably outperforms LS estimation at a lower complexity and represents a good solution for channel estimation technique.
A real-time biomimetic acoustic localizing system using time-shared architecture
NASA Astrophysics Data System (ADS)
Nourzad Karl, Marianne; Karl, Christian; Hubbard, Allyn
2008-04-01
In this paper a real-time sound source localizing system is proposed, which is based on previously developed mammalian auditory models. Traditionally, following the models, which use interaural time delay (ITD) estimates, the amount of parallel computations needed by a system to achieve real-time sound source localization is a limiting factor and a design challenge for hardware implementations. Therefore a new approach using a time-shared architecture implementation is introduced. The proposed architecture is a purely sample-base-driven digital system, and it follows closely the continuous-time approach described in the models. Rather than having dedicated hardware on a per frequency channel basis, a specialized core channel, shared for all frequency bands is used. Having an optimized execution time, which is much less than the system's sample rate, the proposed time-shared solution allows the same number of virtual channels to be processed as the dedicated channels in the traditional approach. Hence, the time-shared approach achieves a highly economical and flexible implementation using minimal silicon area. These aspects are particularly important in efficient hardware implementation of a real time biomimetic sound source localization system.
The dynamical analysis of modified two-compartment neuron model and FPGA implementation
NASA Astrophysics Data System (ADS)
Lin, Qianjin; Wang, Jiang; Yang, Shuangming; Yi, Guosheng; Deng, Bin; Wei, Xile; Yu, Haitao
2017-10-01
The complexity of neural models is increasing with the investigation of larger biological neural network, more various ionic channels and more detailed morphologies, and the implementation of biological neural network is a task with huge computational complexity and power consumption. This paper presents an efficient digital design using piecewise linearization on field programmable gate array (FPGA), to succinctly implement the reduced two-compartment model which retains essential features of more complicated models. The design proposes an approximate neuron model which is composed of a set of piecewise linear equations, and it can reproduce different dynamical behaviors to depict the mechanisms of a single neuron model. The consistency of hardware implementation is verified in terms of dynamical behaviors and bifurcation analysis, and the simulation results including varied ion channel characteristics coincide with the biological neuron model with a high accuracy. Hardware synthesis on FPGA demonstrates that the proposed model has reliable performance and lower hardware resource compared with the original two-compartment model. These investigations are conducive to scalability of biological neural network in reconfigurable large-scale neuromorphic system.
NASA Astrophysics Data System (ADS)
Villar, Xabier; Piso, Daniel; Bruguera, Javier D.
2014-02-01
This paper presents an FPGA implementation of an algorithm, previously published, for the the reconstruction of cosmic rays' trajectories and the determination of the time of arrival and velocity of the particles. The accuracy and precision issues of the algorithm have been analyzed to propose a suitable implementation. Thus, a 32-bit fixed-point format has been used for the representation of the data values. Moreover, the dependencies among the different operations have been taken into account to obtain a highly parallel and efficient hardware implementation. The final hardware architecture requires 18 cycles to process every particle, and has been exhaustively simulated to validate all the design decisions. The architecture has been mapped over different commercial FPGAs, with a frequency of operation ranging from 300 MHz to 1.3 GHz, depending on the FPGA being used. Consequently, the number of particle trajectories processed per second is between 16 million and 72 million. The high number of particle trajectories calculated per second shows that the proposed FPGA implementation might be used also in high rate environments such as those found in particle and nuclear physics experiments.
An Efficient Hardware Circuit for Spike Sorting Based on Competitive Learning Networks.
Chen, Huan-Yuan; Chen, Chih-Chang; Hwang, Wen-Jyi
2017-09-28
This study aims to present an effective VLSI circuit for multi-channel spike sorting. The circuit supports the spike detection, feature extraction and classification operations. The detection circuit is implemented in accordance with the nonlinear energy operator algorithm. Both the peak detection and area computation operations are adopted for the realization of the hardware architecture for feature extraction. The resulting feature vectors are classified by a circuit for competitive learning (CL) neural networks. The CL circuit supports both online training and classification. In the proposed architecture, all the channels share the same detection, feature extraction, learning and classification circuits for a low area cost hardware implementation. The clock-gating technique is also employed for reducing the power dissipation. To evaluate the performance of the architecture, an application-specific integrated circuit (ASIC) implementation is presented. Experimental results demonstrate that the proposed circuit exhibits the advantages of a low chip area, a low power dissipation and a high classification success rate for spike sorting.
An Efficient Hardware Circuit for Spike Sorting Based on Competitive Learning Networks
Chen, Huan-Yuan; Chen, Chih-Chang
2017-01-01
This study aims to present an effective VLSI circuit for multi-channel spike sorting. The circuit supports the spike detection, feature extraction and classification operations. The detection circuit is implemented in accordance with the nonlinear energy operator algorithm. Both the peak detection and area computation operations are adopted for the realization of the hardware architecture for feature extraction. The resulting feature vectors are classified by a circuit for competitive learning (CL) neural networks. The CL circuit supports both online training and classification. In the proposed architecture, all the channels share the same detection, feature extraction, learning and classification circuits for a low area cost hardware implementation. The clock-gating technique is also employed for reducing the power dissipation. To evaluate the performance of the architecture, an application-specific integrated circuit (ASIC) implementation is presented. Experimental results demonstrate that the proposed circuit exhibits the advantages of a low chip area, a low power dissipation and a high classification success rate for spike sorting. PMID:28956859
Tri-state delta modulation system for Space Shuttle digital TV downlink
NASA Technical Reports Server (NTRS)
Udalov, S.; Huth, G. K.; Roberts, D.; Batson, B. H.
1981-01-01
Future requirements for Shuttle Orbiter downlink communication may include transmission of digital video which, in addition to black and white, may also be either field-sequential or NTSC color format. The use of digitized video could provide for picture privacy at the expense of additional onboard hardware, together with an increased bandwidth due to the digitization process. A general objective for the Space Shuttle application is to develop a digitization technique that is compatible with data rates in the 20-30 Mbps range but still provides good quality pictures. This paper describes a tri-state delta modulation/demodulation (TSDM) technique which is a good compromise between implementation complexity and performance. The unique feature of TSDM is that it provides for efficient run-length encoding of constant-intensity segments of a TV picture. Axiomatix has developed a hardware implementation of a high-speed TSDM transmitter and receiver for black-and-white TV and field-sequential color. The hardware complexity of this TSDM implementation is summarized in the paper.
Hardware Prototyping of Neural Network based Fetal Electrocardiogram Extraction
NASA Astrophysics Data System (ADS)
Hasan, M. A.; Reaz, M. B. I.
2012-01-01
The aim of this paper is to model the algorithm for Fetal ECG (FECG) extraction from composite abdominal ECG (AECG) using VHDL (Very High Speed Integrated Circuit Hardware Description Language) for FPGA (Field Programmable Gate Array) implementation. Artificial Neural Network that provides efficient and effective ways of separating FECG signal from composite AECG signal has been designed. The proposed method gives an accuracy of 93.7% for R-peak detection in FHR monitoring. The designed VHDL model is synthesized and fitted into Altera's Stratix II EP2S15F484C3 using the Quartus II version 8.0 Web Edition for FPGA implementation.
A Low-Complexity and High-Performance 2D Look-Up Table for LDPC Hardware Implementation
NASA Astrophysics Data System (ADS)
Chen, Jung-Chieh; Yang, Po-Hui; Lain, Jenn-Kaie; Chung, Tzu-Wen
In this paper, we propose a low-complexity, high-efficiency two-dimensional look-up table (2D LUT) for carrying out the sum-product algorithm in the decoding of low-density parity-check (LDPC) codes. Instead of employing adders for the core operation when updating check node messages, in the proposed scheme, the main term and correction factor of the core operation are successfully merged into a compact 2D LUT. Simulation results indicate that the proposed 2D LUT not only attains close-to-optimal bit error rate performance but also enjoys a low complexity advantage that is suitable for hardware implementation.
Milde, Moritz B.; Blum, Hermann; Dietmüller, Alexander; Sumislawska, Dora; Conradt, Jörg; Indiveri, Giacomo; Sandamirskaya, Yulia
2017-01-01
Neuromorphic hardware emulates dynamics of biological neural networks in electronic circuits offering an alternative to the von Neumann computing architecture that is low-power, inherently parallel, and event-driven. This hardware allows to implement neural-network based robotic controllers in an energy-efficient way with low latency, but requires solving the problem of device variability, characteristic for analog electronic circuits. In this work, we interfaced a mixed-signal analog-digital neuromorphic processor ROLLS to a neuromorphic dynamic vision sensor (DVS) mounted on a robotic vehicle and developed an autonomous neuromorphic agent that is able to perform neurally inspired obstacle-avoidance and target acquisition. We developed a neural network architecture that can cope with device variability and verified its robustness in different environmental situations, e.g., moving obstacles, moving target, clutter, and poor light conditions. We demonstrate how this network, combined with the properties of the DVS, allows the robot to avoid obstacles using a simple biologically-inspired dynamics. We also show how a Dynamic Neural Field for target acquisition can be implemented in spiking neuromorphic hardware. This work demonstrates an implementation of working obstacle avoidance and target acquisition using mixed signal analog/digital neuromorphic hardware. PMID:28747883
Milde, Moritz B; Blum, Hermann; Dietmüller, Alexander; Sumislawska, Dora; Conradt, Jörg; Indiveri, Giacomo; Sandamirskaya, Yulia
2017-01-01
Neuromorphic hardware emulates dynamics of biological neural networks in electronic circuits offering an alternative to the von Neumann computing architecture that is low-power, inherently parallel, and event-driven. This hardware allows to implement neural-network based robotic controllers in an energy-efficient way with low latency, but requires solving the problem of device variability, characteristic for analog electronic circuits. In this work, we interfaced a mixed-signal analog-digital neuromorphic processor ROLLS to a neuromorphic dynamic vision sensor (DVS) mounted on a robotic vehicle and developed an autonomous neuromorphic agent that is able to perform neurally inspired obstacle-avoidance and target acquisition. We developed a neural network architecture that can cope with device variability and verified its robustness in different environmental situations, e.g., moving obstacles, moving target, clutter, and poor light conditions. We demonstrate how this network, combined with the properties of the DVS, allows the robot to avoid obstacles using a simple biologically-inspired dynamics. We also show how a Dynamic Neural Field for target acquisition can be implemented in spiking neuromorphic hardware. This work demonstrates an implementation of working obstacle avoidance and target acquisition using mixed signal analog/digital neuromorphic hardware.
FPGA implementation of neuro-fuzzy system with improved PSO learning.
Karakuzu, Cihan; Karakaya, Fuat; Çavuşlu, Mehmet Ali
2016-07-01
This paper presents the first hardware implementation of neuro-fuzzy system (NFS) with its metaheuristic learning ability on field programmable gate array (FPGA). Metaheuristic learning of NFS for all of its parameters is accomplished by using the improved particle swarm optimization (iPSO). As a second novelty, a new functional approach, which does not require any memory and multiplier usage, is proposed for the Gaussian membership functions of NFS. NFS and its learning using iPSO are implemented on Xilinx Virtex5 xc5vlx110-3ff1153 and efficiency of the proposed implementation tested on two dynamic system identification problems and licence plate detection problem as a practical application. Results indicate that proposed NFS implementation and membership function approximation is as effective as the other approaches available in the literature but requires less hardware resources. Copyright © 2016 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Yang, Shuangming; Deng, Bin; Wang, Jiang; Li, Huiyan; Liu, Chen; Fietkiewicz, Chris; Loparo, Kenneth A.
2017-01-01
Real-time estimation of dynamical characteristics of thalamocortical cells, such as dynamics of ion channels and membrane potentials, is useful and essential in the study of the thalamus in Parkinsonian state. However, measuring the dynamical properties of ion channels is extremely challenging experimentally and even impossible in clinical applications. This paper presents and evaluates a real-time estimation system for thalamocortical hidden properties. For the sake of efficiency, we use a field programmable gate array for strictly hardware-based computation and algorithm optimization. In the proposed system, the FPGA-based unscented Kalman filter is implemented into a conductance-based TC neuron model. Since the complexity of TC neuron model restrains its hardware implementation in parallel structure, a cost efficient model is proposed to reduce the resource cost while retaining the relevant ionic dynamics. Experimental results demonstrate the real-time capability to estimate thalamocortical hidden properties with high precision under both normal and Parkinsonian states. While it is applied to estimate the hidden properties of the thalamus and explore the mechanism of the Parkinsonian state, the proposed method can be useful in the dynamic clamp technique of the electrophysiological experiments, the neural control engineering and brain-machine interface studies.
Hardware efficient monitoring of input/output signals
NASA Technical Reports Server (NTRS)
Driscoll, Kevin R. (Inventor); Hall, Brendan (Inventor); Paulitsch, Michael (Inventor)
2012-01-01
A communication device comprises first and second circuits to implement a plurality of ports via which the communicative device is operable to communicate over a plurality of communication channels. For each of the plurality of ports, the communication device comprises: command hardware that includes a first transmitter to transmit data over a respective one of the plurality of channels and a first receiver to receive data from the respective one of the plurality of channels; and monitor hardware that includes a second receiver coupled to the first transmitter and a third receiver coupled to the respective one of the plurality of channels. The first circuit comprises the command hardware for a first subset of the plurality of ports. The second circuit comprises the monitor hardware for the first subset of the plurality of ports and the command hardware for a second subset of the plurality of ports.
Performance/price estimates for cortex-scale hardware: a design space exploration.
Zaveri, Mazad S; Hammerstrom, Dan
2011-04-01
In this paper, we revisit the concept of virtualization. Virtualization is useful for understanding and investigating the performance/price and other trade-offs related to the hardware design space. Moreover, it is perhaps the most important aspect of a hardware design space exploration. Such a design space exploration is a necessary part of the study of hardware architectures for large-scale computational models for intelligent computing, including AI, Bayesian, bio-inspired and neural models. A methodical exploration is needed to identify potentially interesting regions in the design space, and to assess the relative performance/price points of these implementations. As an example, in this paper we investigate the performance/price of (digital and mixed-signal) CMOS and hypothetical CMOL (nanogrid) technology based hardware implementations of human cortex-scale spiking neural systems. Through this analysis, and the resulting performance/price points, we demonstrate, in general, the importance of virtualization, and of doing these kinds of design space explorations. The specific results suggest that hybrid nanotechnology such as CMOL is a promising candidate to implement very large-scale spiking neural systems, providing a more efficient utilization of the density and storage benefits of emerging nano-scale technologies. In general, we believe that the study of such hypothetical designs/architectures will guide the neuromorphic hardware community towards building large-scale systems, and help guide research trends in intelligent computing, and computer engineering. Copyright © 2010 Elsevier Ltd. All rights reserved.
Efficient Phase Unwrapping Architecture for Digital Holographic Microscopy
Hwang, Wen-Jyi; Cheng, Shih-Chang; Cheng, Chau-Jern
2011-01-01
This paper presents a novel phase unwrapping architecture for accelerating the computational speed of digital holographic microscopy (DHM). A fast Fourier transform (FFT) based phase unwrapping algorithm providing a minimum squared error solution is adopted for hardware implementation because of its simplicity and robustness to noise. The proposed architecture is realized in a pipeline fashion to maximize throughput of the computation. Moreover, the number of hardware multipliers and dividers are minimized to reduce the hardware costs. The proposed architecture is used as a custom user logic in a system on programmable chip (SOPC) for physical performance measurement. Experimental results reveal that the proposed architecture is effective for expediting the computational speed while consuming low hardware resources for designing an embedded DHM system. PMID:22163688
A domain specific language for performance portable molecular dynamics algorithms
NASA Astrophysics Data System (ADS)
Saunders, William Robert; Grant, James; Müller, Eike Hermann
2018-03-01
Developers of Molecular Dynamics (MD) codes face significant challenges when adapting existing simulation packages to new hardware. In a continuously diversifying hardware landscape it becomes increasingly difficult for scientists to be experts both in their own domain (physics/chemistry/biology) and specialists in the low level parallelisation and optimisation of their codes. To address this challenge, we describe a "Separation of Concerns" approach for the development of parallel and optimised MD codes: the science specialist writes code at a high abstraction level in a domain specific language (DSL), which is then translated into efficient computer code by a scientific programmer. In a related context, an abstraction for the solution of partial differential equations with grid based methods has recently been implemented in the (Py)OP2 library. Inspired by this approach, we develop a Python code generation system for molecular dynamics simulations on different parallel architectures, including massively parallel distributed memory systems and GPUs. We demonstrate the efficiency of the auto-generated code by studying its performance and scalability on different hardware and compare it to other state-of-the-art simulation packages. With growing data volumes the extraction of physically meaningful information from the simulation becomes increasingly challenging and requires equally efficient implementations. A particular advantage of our approach is the easy expression of such analysis algorithms. We consider two popular methods for deducing the crystalline structure of a material from the local environment of each atom, show how they can be expressed in our abstraction and implement them in the code generation framework.
Pulseq: A rapid and hardware-independent pulse sequence prototyping framework.
Layton, Kelvin J; Kroboth, Stefan; Jia, Feng; Littin, Sebastian; Yu, Huijun; Leupold, Jochen; Nielsen, Jon-Fredrik; Stöcker, Tony; Zaitsev, Maxim
2017-04-01
Implementing new magnetic resonance experiments, or sequences, often involves extensive programming on vendor-specific platforms, which can be time consuming and costly. This situation is exacerbated when research sequences need to be implemented on several platforms simultaneously, for example, at different field strengths. This work presents an alternative programming environment that is hardware-independent, open-source, and promotes rapid sequence prototyping. A novel file format is described to efficiently store the hardware events and timing information required for an MR pulse sequence. Platform-dependent interpreter modules convert the file to appropriate instructions to run the sequence on MR hardware. Sequences can be designed in high-level languages, such as MATLAB, or with a graphical interface. Spin physics simulation tools are incorporated into the framework, allowing for comparison between real and virtual experiments. Minimal effort is required to implement relatively advanced sequences using the tools provided. Sequences are executed on three different MR platforms, demonstrating the flexibility of the approach. A high-level, flexible and hardware-independent approach to sequence programming is ideal for the rapid development of new sequences. The framework is currently not suitable for large patient studies or routine scanning although this would be possible with deeper integration into existing workflows. Magn Reson Med 77:1544-1552, 2017. © 2016 International Society for Magnetic Resonance in Medicine. © 2016 International Society for Magnetic Resonance in Medicine.
A triple-mode hexa-standard reconfigurable TI cross-coupled ΣΔ modulator
NASA Astrophysics Data System (ADS)
Prakash A. V, Jos; Jose, Babita R.; Mathew, Jimson; Jose, Bijoy A.
2017-07-01
Hardware reconfigurability is an attractive solution for modern multi-standard wireless systems. This paper analyses the performance and implementation of an efficient triple-mode hexa-standard reconfigurable sigma-delta (∑Δ) modulator designed for six different wireless communication standards. Enhanced noise-shaping characteristics and increased digitisation rate, obtained by time-interleaved cross-coupling of ∑Δ paths, have been utilised for the modulator design. Power/hardware efficiency and the capability to acclimate the requirements of wide hexa-standard specifications are achieved by introducing an advanced noise-shaping structure, the dual-extended architecture. Simulation results of the proposed architecture using Hspice shows that the proposed modulator obtains a peak signal-to-noise ratio of 83.4/80.2/67.8/61.5/60.8/51.03 dB for hexa-standards, i.e. GSM/Bluetooth/GPS/WCDMA/WLAN/WiMAX standards with significantly less hardware and low operating frequency. The proposed architecture is implemented in 45 nm CMOS process using a 1 V supply and 0.7 V input range with a power consumption of 1.93 mW. Both architectural- and transistor-level simulation results prove the effectiveness and feasibility of this architecture to accomplish multi-standard cellular communication characteristics.
High-Density Liquid-State Machine Circuitry for Time-Series Forecasting.
Rosselló, Josep L; Alomar, Miquel L; Morro, Antoni; Oliver, Antoni; Canals, Vincent
2016-08-01
Spiking neural networks (SNN) are the last neural network generation that try to mimic the real behavior of biological neurons. Although most research in this area is done through software applications, it is in hardware implementations in which the intrinsic parallelism of these computing systems are more efficiently exploited. Liquid state machines (LSM) have arisen as a strategic technique to implement recurrent designs of SNN with a simple learning methodology. In this work, we show a new low-cost methodology to implement high-density LSM by using Boolean gates. The proposed method is based on the use of probabilistic computing concepts to reduce hardware requirements, thus considerably increasing the neuron count per chip. The result is a highly functional system that is applied to high-speed time series forecasting.
Low-level rf control of Spallation Neutron Source: System and characterization
NASA Astrophysics Data System (ADS)
Ma, Hengjie; Champion, Mark; Crofford, Mark; Kasemir, Kay-Uwe; Piller, Maurice; Doolittle, Lawrence; Ratti, Alex
2006-03-01
The low-level rf control system currently commissioned throughout the Spallation Neutron Source (SNS) LINAC evolved from three design iterations over 1 yr intensive research and development. Its digital hardware implementation is efficient, and has succeeded in achieving a minimum latency of less than 150 ns which is the key for accomplishing an all-digital feedback control for the full bandwidth. The control bandwidth is analyzed in frequency domain and characterized by testing its transient response. The hardware implementation also includes the provision of a time-shared input channel for a superior phase differential measurement between the cavity field and the reference. A companion cosimulation system for the digital hardware was developed to ensure a reliable long-term supportability. A large effort has also been made in the operation software development for the practical issues such as the process automations, cavity filling, beam loading compensation, and the cavity mechanical resonance suppression.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pais Pitta de Lacerda Ruivo, Tiago; Bernabeu Altayo, Gerard; Garzoglio, Gabriele
2014-11-11
has been widely accepted that software virtualization has a big negative impact on high-performance computing (HPC) application performance. This work explores the potential use of Infiniband hardware virtualization in an OpenNebula cloud towards the efficient support of MPI-based workloads. We have implemented, deployed, and tested an Infiniband network on the FermiCloud private Infrastructure-as-a-Service (IaaS) cloud. To avoid software virtualization towards minimizing the virtualization overhead, we employed a technique called Single Root Input/Output Virtualization (SRIOV). Our solution spanned modifications to the Linux’s Hypervisor as well as the OpenNebula manager. We evaluated the performance of the hardware virtualization on up to 56more » virtual machines connected by up to 8 DDR Infiniband network links, with micro-benchmarks (latency and bandwidth) as well as w a MPI-intensive application (the HPL Linpack benchmark).« less
Improving the energy efficiency of sparse linear system solvers on multicore and manycore systems.
Anzt, H; Quintana-Ortí, E S
2014-06-28
While most recent breakthroughs in scientific research rely on complex simulations carried out in large-scale supercomputers, the power draft and energy spent for this purpose is increasingly becoming a limiting factor to this trend. In this paper, we provide an overview of the current status in energy-efficient scientific computing by reviewing different technologies used to monitor power draft as well as power- and energy-saving mechanisms available in commodity hardware. For the particular domain of sparse linear algebra, we analyse the energy efficiency of a broad collection of hardware architectures and investigate how algorithmic and implementation modifications can improve the energy performance of sparse linear system solvers, without negatively impacting their performance. © 2014 The Author(s) Published by the Royal Society. All rights reserved.
Design and implementation of H.264 based embedded video coding technology
NASA Astrophysics Data System (ADS)
Mao, Jian; Liu, Jinming; Zhang, Jiemin
2016-03-01
In this paper, an embedded system for remote online video monitoring was designed and developed to capture and record the real-time circumstances in elevator. For the purpose of improving the efficiency of video acquisition and processing, the system selected Samsung S5PV210 chip as the core processor which Integrated graphics processing unit. And the video was encoded with H.264 format for storage and transmission efficiently. Based on S5PV210 chip, the hardware video coding technology was researched, which was more efficient than software coding. After running test, it had been proved that the hardware video coding technology could obviously reduce the cost of system and obtain the more smooth video display. It can be widely applied for the security supervision [1].
Hardware implementation of CMAC neural network with reduced storage requirement.
Ker, J S; Kuo, Y H; Wen, R C; Liu, B D
1997-01-01
The cerebellar model articulation controller (CMAC) neural network has the advantages of fast convergence speed and low computation complexity. However, it suffers from a low storage space utilization rate on weight memory. In this paper, we propose a direct weight address mapping approach, which can reduce the required weight memory size with a utilization rate near 100%. Based on such an address mapping approach, we developed a pipeline architecture to efficiently perform the addressing operations. The proposed direct weight address mapping approach also speeds up the computation for the generation of weight addresses. Besides, a CMAC hardware prototype used for color calibration has been implemented to confirm the proposed approach and architecture.
Families of FPGA-Based Accelerators for Approximate String Matching1
Van Court, Tom; Herbordt, Martin C.
2011-01-01
Dynamic programming for approximate string matching is a large family of different algorithms, which vary significantly in purpose, complexity, and hardware utilization. Many implementations have reported impressive speed-ups, but have typically been point solutions – highly specialized and addressing only one or a few of the many possible options. The problem to be solved is creating a hardware description that implements a broad range of behavioral options without losing efficiency due to feature bloat. We report a set of three component types that address different parts of the approximate string matching problem. This allows each application to choose the feature set required, then make maximum use of the FPGA fabric according to that application’s specific resource requirements. Multiple, interchangeable implementations are available for each component type. We show that these methods allow the efficient generation of a large, if not complete, family of accelerators for this application. This flexibility was obtained while retaining high performance: We have evaluated a sample against serial reference codes and found speed-ups of from 150× to 400× over a high-end PC. PMID:21603598
Exploiting the chaotic behaviour of atmospheric models with reconfigurable architectures
NASA Astrophysics Data System (ADS)
Russell, Francis P.; Düben, Peter D.; Niu, Xinyu; Luk, Wayne; Palmer, T. N.
2017-12-01
Reconfigurable architectures are becoming mainstream: Amazon, Microsoft and IBM are supporting such architectures in their data centres. The computationally intensive nature of atmospheric modelling is an attractive target for hardware acceleration using reconfigurable computing. Performance of hardware designs can be improved through the use of reduced-precision arithmetic, but maintaining appropriate accuracy is essential. We explore reduced-precision optimisation for simulating chaotic systems, targeting atmospheric modelling, in which even minor changes in arithmetic behaviour will cause simulations to diverge quickly. The possibility of equally valid simulations having differing outcomes means that standard techniques for comparing numerical accuracy are inappropriate. We use the Hellinger distance to compare statistical behaviour between reduced-precision CPU implementations to guide reconfigurable designs of a chaotic system, then analyse accuracy, performance and power efficiency of the resulting implementations. Our results show that with only a limited loss in accuracy corresponding to less than 10% uncertainty in input parameters, the throughput and energy efficiency of a single-precision chaotic system implemented on a Xilinx Virtex-6 SX475T Field Programmable Gate Array (FPGA) can be more than doubled.
Battery voltage-balancing applications of disk-type radial mode Pb(Zr • Ti)O3 ceramic resonator
NASA Astrophysics Data System (ADS)
Thenathayalan, Daniel; Lee, Chun-gu; Park, Joung-hu
2017-10-01
In this paper, we propose a novel technique to build a charge-balancing circuit for series-connected battery strings using various kinds of disk-type ceramic Pb(Zr • Ti)O3 piezoelectric resonators (PRs). The use of PRs replaces the whole external battery voltage-balancer circuit, which consists mainly of a bulky magnetic element. The proposed technique is validated using different ceramic PRs and the results are analyzed in terms of their physical properties. A series-connected battery string with a voltage rating of 61.5 V is set as a hardware prototype under test, then the power transfer efficiency of the system is measured at different imbalance voltages. The performance of the proposed battery voltage-balancer circuit employed with a PR is also validated through hardware implementation. Furthermore, the temperature distribution image of the PR is obtained to compare power transfer efficiency and thermal stress under different operating conditions. The test results show that the battery voltage-balancer circuit can be successfully implemented using PRs with the maximum power conversion efficiency of over 96% for energy storage systems.
Demonstration of a small programmable quantum computer with atomic qubits.
Debnath, S; Linke, N M; Figgatt, C; Landsman, K A; Wright, K; Monroe, C
2016-08-04
Quantum computers can solve certain problems more efficiently than any possible conventional computer. Small quantum algorithms have been demonstrated on multiple quantum computing platforms, many specifically tailored in hardware to implement a particular algorithm or execute a limited number of computational paths. Here we demonstrate a five-qubit trapped-ion quantum computer that can be programmed in software to implement arbitrary quantum algorithms by executing any sequence of universal quantum logic gates. We compile algorithms into a fully connected set of gate operations that are native to the hardware and have a mean fidelity of 98 per cent. Reconfiguring these gate sequences provides the flexibility to implement a variety of algorithms without altering the hardware. As examples, we implement the Deutsch-Jozsa and Bernstein-Vazirani algorithms with average success rates of 95 and 90 per cent, respectively. We also perform a coherent quantum Fourier transform on five trapped-ion qubits for phase estimation and period finding with average fidelities of 62 and 84 per cent, respectively. This small quantum computer can be scaled to larger numbers of qubits within a single register, and can be further expanded by connecting several such modules through ion shuttling or photonic quantum channels.
Demonstration of a small programmable quantum computer with atomic qubits
NASA Astrophysics Data System (ADS)
Debnath, S.; Linke, N. M.; Figgatt, C.; Landsman, K. A.; Wright, K.; Monroe, C.
2016-08-01
Quantum computers can solve certain problems more efficiently than any possible conventional computer. Small quantum algorithms have been demonstrated on multiple quantum computing platforms, many specifically tailored in hardware to implement a particular algorithm or execute a limited number of computational paths. Here we demonstrate a five-qubit trapped-ion quantum computer that can be programmed in software to implement arbitrary quantum algorithms by executing any sequence of universal quantum logic gates. We compile algorithms into a fully connected set of gate operations that are native to the hardware and have a mean fidelity of 98 per cent. Reconfiguring these gate sequences provides the flexibility to implement a variety of algorithms without altering the hardware. As examples, we implement the Deutsch-Jozsa and Bernstein-Vazirani algorithms with average success rates of 95 and 90 per cent, respectively. We also perform a coherent quantum Fourier transform on five trapped-ion qubits for phase estimation and period finding with average fidelities of 62 and 84 per cent, respectively. This small quantum computer can be scaled to larger numbers of qubits within a single register, and can be further expanded by connecting several such modules through ion shuttling or photonic quantum channels.
NASA Astrophysics Data System (ADS)
Qiu, Mo; Yu, Simin; Wen, Yuqiong; Lü, Jinhu; He, Jianbin; Lin, Zhuosheng
In this paper, a novel design methodology and its FPGA hardware implementation for a universal chaotic signal generator is proposed via the Verilog HDL fixed-point algorithm and state machine control. According to continuous-time or discrete-time chaotic equations, a Verilog HDL fixed-point algorithm and its corresponding digital system are first designed. In the FPGA hardware platform, each operation step of Verilog HDL fixed-point algorithm is then controlled by a state machine. The generality of this method is that, for any given chaotic equation, it can be decomposed into four basic operation procedures, i.e. nonlinear function calculation, iterative sequence operation, iterative values right shifting and ceiling, and chaotic iterative sequences output, each of which corresponds to only a state via state machine control. Compared with the Verilog HDL floating-point algorithm, the Verilog HDL fixed-point algorithm can save the FPGA hardware resources and improve the operation efficiency. FPGA-based hardware experimental results validate the feasibility and reliability of the proposed approach.
NASA Astrophysics Data System (ADS)
García, Aday; Santos, Lucana; López, Sebastián.; Callicó, Gustavo M.; Lopez, Jose F.; Sarmiento, Roberto
2014-05-01
Efficient onboard satellite hyperspectral image compression represents a necessity and a challenge for current and future space missions. Therefore, it is mandatory to provide hardware implementations for this type of algorithms in order to achieve the constraints required for onboard compression. In this work, we implement the Lossy Compression for Exomars (LCE) algorithm on an FPGA by means of high-level synthesis (HSL) in order to shorten the design cycle. Specifically, we use CatapultC HLS tool to obtain a VHDL description of the LCE algorithm from C-language specifications. Two different approaches are followed for HLS: on one hand, introducing the whole C-language description in CatapultC and on the other hand, splitting the C-language description in functional modules to be implemented independently with CatapultC, connecting and controlling them by an RTL description code without HLS. In both cases the goal is to obtain an FPGA implementation. We explain the several changes applied to the original Clanguage source code in order to optimize the results obtained by CatapultC for both approaches. Experimental results show low area occupancy of less than 15% for a SRAM-based Virtex-5 FPGA and a maximum frequency above 80 MHz. Additionally, the LCE compressor was implemented into an RTAX2000S antifuse-based FPGA, showing an area occupancy of 75% and a frequency around 53 MHz. All these serve to demonstrate that the LCE algorithm can be efficiently executed on an FPGA onboard a satellite. A comparison between both implementation approaches is also provided. The performance of the algorithm is finally compared with implementations on other technologies, specifically a graphics processing unit (GPU) and a single-threaded CPU.
FPGA implementation of a configurable neuromorphic CPG-based locomotion controller.
Barron-Zambrano, Jose Hugo; Torres-Huitzil, Cesar
2013-09-01
Neuromorphic engineering is a discipline devoted to the design and development of computational hardware that mimics the characteristics and capabilities of neuro-biological systems. In recent years, neuromorphic hardware systems have been implemented using a hybrid approach incorporating digital hardware so as to provide flexibility and scalability at the cost of power efficiency and some biological realism. This paper proposes an FPGA-based neuromorphic-like embedded system on a chip to generate locomotion patterns of periodic rhythmic movements inspired by Central Pattern Generators (CPGs). The proposed implementation follows a top-down approach where modularity and hierarchy are two desirable features. The locomotion controller is based on CPG models to produce rhythmic locomotion patterns or gaits for legged robots such as quadrupeds and hexapods. The architecture is configurable and scalable for robots with either different morphologies or different degrees of freedom (DOFs). Experiments performed on a real robot are presented and discussed. The obtained results demonstrate that the CPG-based controller provides the necessary flexibility to generate different rhythmic patterns at run-time suitable for adaptable locomotion. Copyright © 2013 Elsevier Ltd. All rights reserved.
Neuromorphic Hardware Architecture Using the Neural Engineering Framework for Pattern Recognition.
Wang, Runchun; Thakur, Chetan Singh; Cohen, Gregory; Hamilton, Tara Julia; Tapson, Jonathan; van Schaik, Andre
2017-06-01
We present a hardware architecture that uses the neural engineering framework (NEF) to implement large-scale neural networks on field programmable gate arrays (FPGAs) for performing massively parallel real-time pattern recognition. NEF is a framework that is capable of synthesising large-scale cognitive systems from subnetworks and we have previously presented an FPGA implementation of the NEF that successfully performs nonlinear mathematical computations. That work was developed based on a compact digital neural core, which consists of 64 neurons that are instantiated by a single physical neuron using a time-multiplexing approach. We have now scaled this approach up to build a pattern recognition system by combining identical neural cores together. As a proof of concept, we have developed a handwritten digit recognition system using the MNIST database and achieved a recognition rate of 96.55%. The system is implemented on a state-of-the-art FPGA and can process 5.12 million digits per second. The architecture and hardware optimisations presented offer high-speed and resource-efficient means for performing high-speed, neuromorphic, and massively parallel pattern recognition and classification tasks.
Resource Efficient Hardware Architecture for Fast Computation of Running Max/Min Filters
Torres-Huitzil, Cesar
2013-01-01
Running max/min filters on rectangular kernels are widely used in many digital signal and image processing applications. Filtering with a k × k kernel requires of k 2 − 1 comparisons per sample for a direct implementation; thus, performance scales expensively with the kernel size k. Faster computations can be achieved by kernel decomposition and using constant time one-dimensional algorithms on custom hardware. This paper presents a hardware architecture for real-time computation of running max/min filters based on the van Herk/Gil-Werman (HGW) algorithm. The proposed architecture design uses less computation and memory resources than previously reported architectures when targeted to Field Programmable Gate Array (FPGA) devices. Implementation results show that the architecture is able to compute max/min filters, on 1024 × 1024 images with up to 255 × 255 kernels, in around 8.4 milliseconds, 120 frames per second, at a clock frequency of 250 MHz. The implementation is highly scalable for the kernel size with good performance/area tradeoff suitable for embedded applications. The applicability of the architecture is shown for local adaptive image thresholding. PMID:24288456
Carrillo, Snaider; Harkin, Jim; McDaid, Liam; Pande, Sandeep; Cawley, Seamus; McGinley, Brian; Morgan, Fearghal
2012-09-01
The brain is highly efficient in how it processes information and tolerates faults. Arguably, the basic processing units are neurons and synapses that are interconnected in a complex pattern. Computer scientists and engineers aim to harness this efficiency and build artificial neural systems that can emulate the key information processing principles of the brain. However, existing approaches cannot provide the dense interconnect for the billions of neurons and synapses that are required. Recently a reconfigurable and biologically inspired paradigm based on network-on-chip (NoC) and spiking neural networks (SNNs) has been proposed as a new method of realising an efficient, robust computing platform. However, the use of the NoC as an interconnection fabric for large-scale SNNs demands a good trade-off between scalability, throughput, neuron/synapse ratio and power consumption. This paper presents a novel traffic-aware, adaptive NoC router, which forms part of a proposed embedded mixed-signal SNN architecture called EMBRACE (EMulating Biologically-inspiRed ArChitectures in hardwarE). The proposed adaptive NoC router provides the inter-neuron connectivity for EMBRACE, maintaining router communication and avoiding dropped router packets by adapting to router traffic congestion. Results are presented on throughput, power and area performance analysis of the adaptive router using a 90 nm CMOS technology which outperforms existing NoCs in this domain. The adaptive behaviour of the router is also verified on a Stratix II FPGA implementation of a 4 × 2 router array with real-time traffic congestion. The presented results demonstrate the feasibility of using the proposed adaptive NoC router within the EMBRACE architecture to realise large-scale SNNs on embedded hardware. Copyright © 2012 Elsevier Ltd. All rights reserved.
Diamond, Alan; Nowotny, Thomas; Schmuker, Michael
2016-01-01
Neuromorphic computing employs models of neuronal circuits to solve computing problems. Neuromorphic hardware systems are now becoming more widely available and “neuromorphic algorithms” are being developed. As they are maturing toward deployment in general research environments, it becomes important to assess and compare them in the context of the applications they are meant to solve. This should encompass not just task performance, but also ease of implementation, speed of processing, scalability, and power efficiency. Here, we report our practical experience of implementing a bio-inspired, spiking network for multivariate classification on three different platforms: the hybrid digital/analog Spikey system, the digital spike-based SpiNNaker system, and GeNN, a meta-compiler for parallel GPU hardware. We assess performance using a standard hand-written digit classification task. We found that whilst a different implementation approach was required for each platform, classification performances remained in line. This suggests that all three implementations were able to exercise the model's ability to solve the task rather than exposing inherent platform limits, although differences emerged when capacity was approached. With respect to execution speed and power consumption, we found that for each platform a large fraction of the computing time was spent outside of the neuromorphic device, on the host machine. Time was spent in a range of combinations of preparing the model, encoding suitable input spiking data, shifting data, and decoding spike-encoded results. This is also where a large proportion of the total power was consumed, most markedly for the SpiNNaker and Spikey systems. We conclude that the simulation efficiency advantage of the assessed specialized hardware systems is easily lost in excessive host-device communication, or non-neuronal parts of the computation. These results emphasize the need to optimize the host-device communication architecture for scalability, maximum throughput, and minimum latency. Moreover, our results indicate that special attention should be paid to minimize host-device communication when designing and implementing networks for efficient neuromorphic computing. PMID:26778950
Batched matrix computations on hardware accelerators based on GPUs
Haidar, Azzam; Dong, Tingxing; Luszczek, Piotr; ...
2015-02-09
Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This study, consequently, describes the development of the most common, one-sidedmore » factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient in our applications’ context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. Finally, the tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5× speedup on the NVIDIA K40 GPU.« less
VLSI implementation of RSA encryption system using ancient Indian Vedic mathematics
NASA Astrophysics Data System (ADS)
Thapliyal, Himanshu; Srinivas, M. B.
2005-06-01
This paper proposes the hardware implementation of RSA encryption/decryption algorithm using the algorithms of Ancient Indian Vedic Mathematics that have been modified to improve performance. The recently proposed hierarchical overlay multiplier architecture is used in the RSA circuitry for multiplication operation. The most significant aspect of the paper is the development of a division architecture based on Straight Division algorithm of Ancient Indian Vedic Mathematics and embedding it in RSA encryption/decryption circuitry for improved efficiency. The coding is done in Verilog HDL and the FPGA synthesis is done using Xilinx Spartan library. The results show that RSA circuitry implemented using Vedic division and multiplication is efficient in terms of area/speed compared to its implementation using conventional multiplication and division architectures.
Levin, David; Aladl, Usaf; Germano, Guido; Slomka, Piotr
2005-09-01
We exploit consumer graphics hardware to perform real-time processing and visualization of high-resolution, 4D cardiac data. We have implemented real-time, realistic volume rendering, interactive 4D motion segmentation of cardiac data, visualization of multi-modality cardiac data and 3D display of multiple series cardiac MRI. We show that an ATI Radeon 9700 Pro can render a 512x512x128 cardiac Computed Tomography (CT) study at 0.9 to 60 frames per second (fps) depending on rendering parameters and that 4D motion based segmentation can be performed in real-time. We conclude that real-time rendering and processing of cardiac data can be implemented on consumer graphics cards.
Proposed hardware architectures of particle filter for object tracking
NASA Astrophysics Data System (ADS)
Abd El-Halym, Howida A.; Mahmoud, Imbaby Ismail; Habib, SED
2012-12-01
In this article, efficient hardware architectures for particle filter (PF) are presented. We propose three different architectures for Sequential Importance Resampling Filter (SIRF) implementation. The first architecture is a two-step sequential PF machine, where particle sampling, weight, and output calculations are carried out in parallel during the first step followed by sequential resampling in the second step. For the weight computation step, a piecewise linear function is used instead of the classical exponential function. This decreases the complexity of the architecture without degrading the results. The second architecture speeds up the resampling step via a parallel, rather than a serial, architecture. This second architecture targets a balance between hardware resources and the speed of operation. The third architecture implements the SIRF as a distributed PF composed of several processing elements and central unit. All the proposed architectures are captured using VHDL synthesized using Xilinx environment, and verified using the ModelSim simulator. Synthesis results confirmed the resource reduction and speed up advantages of our architectures.
NASA Astrophysics Data System (ADS)
Berdychowski, Piotr P.; Zabolotny, Wojciech M.
2010-09-01
The main goal of C to VHDL compiler project is to make FPGA platform more accessible for scientists and software developers. FPGA platform offers unique ability to configure the hardware to implement virtually any dedicated architecture, and modern devices provide sufficient number of hardware resources to implement parallel execution platforms with complex processing units. All this makes the FPGA platform very attractive for those looking for efficient heterogeneous, computing environment. Current industry standard in development of digital systems on FPGA platform is based on HDLs. Although very effective and expressive in hands of hardware development specialists, these languages require specific knowledge and experience, unreachable for most scientists and software programmers. C to VHDL compiler project attempts to remedy that by creating an application, that derives initial VHDL description of a digital system (for further compilation and synthesis), from purely algorithmic description in C programming language. This idea itself is not new, and the C to VHDL compiler combines the best approaches from existing solutions developed over many previous years, with the introduction of some new unique improvements.
NASA Astrophysics Data System (ADS)
Deliparaschos, Kyriakos M.; Michail, Konstantinos; Zolotas, Argyrios C.; Tzafestas, Spyros G.
2016-05-01
This work presents a field programmable gate array (FPGA)-based embedded software platform coupled with a software-based plant, forming a hardware-in-the-loop (HIL) that is used to validate a systematic sensor selection framework. The systematic sensor selection framework combines multi-objective optimization, linear-quadratic-Gaussian (LQG)-type control, and the nonlinear model of a maglev suspension. A robustness analysis of the closed-loop is followed (prior to implementation) supporting the appropriateness of the solution under parametric variation. The analysis also shows that quantization is robust under different controller gains. While the LQG controller is implemented on an FPGA, the physical process is realized in a high-level system modeling environment. FPGA technology enables rapid evaluation of the algorithms and test designs under realistic scenarios avoiding heavy time penalty associated with hardware description language (HDL) simulators. The HIL technique facilitates significant speed-up in the required execution time when compared to its software-based counterpart model.
Learning and optimization with cascaded VLSI neural network building-block chips
NASA Technical Reports Server (NTRS)
Duong, T.; Eberhardt, S. P.; Tran, M.; Daud, T.; Thakoor, A. P.
1992-01-01
To demonstrate the versatility of the building-block approach, two neural network applications were implemented on cascaded analog VLSI chips. Weights were implemented using 7-b multiplying digital-to-analog converter (MDAC) synapse circuits, with 31 x 32 and 32 x 32 synapses per chip. A novel learning algorithm compatible with analog VLSI was applied to the two-input parity problem. The algorithm combines dynamically evolving architecture with limited gradient-descent backpropagation for efficient and versatile supervised learning. To implement the learning algorithm in hardware, synapse circuits were paralleled for additional quantization levels. The hardware-in-the-loop learning system allocated 2-5 hidden neurons for parity problems. Also, a 7 x 7 assignment problem was mapped onto a cascaded 64-neuron fully connected feedback network. In 100 randomly selected problems, the network found optimal or good solutions in most cases, with settling times in the range of 7-100 microseconds.
Quantization selection in the high-throughput H.264/AVC encoder based on the RD
NASA Astrophysics Data System (ADS)
Pastuszak, Grzegorz
2013-10-01
In the hardware video encoder, the quantization is responsible for quality losses. On the other hand, it allows the reduction of bit rates to the target one. If the mode selection is based on the rate-distortion criterion, the quantization can also be adjusted to obtain better compression efficiency. Particularly, the use of Lagrangian function with a given multiplier enables the encoder to select the most suitable quantization step determined by the quantization parameter QP. Moreover, the quantization offset added before discarding the fraction value after quantization can be adjusted. In order to select the best quantization parameter and offset in real time, the HD/SD encoder should be implemented in the hardware. In particular, the hardware architecture should embed the transformation and quantization modules able to process the same residuals many times. In this work, such an architecture is used. Experimental results show what improvements in terms of compression efficiency are achievable for Intra coding.
ANNarchy: a code generation approach to neural simulations on parallel hardware
Vitay, Julien; Dinkelbach, Helge Ü.; Hamker, Fred H.
2015-01-01
Many modern neural simulators focus on the simulation of networks of spiking neurons on parallel hardware. Another important framework in computational neuroscience, rate-coded neural networks, is mostly difficult or impossible to implement using these simulators. We present here the ANNarchy (Artificial Neural Networks architect) neural simulator, which allows to easily define and simulate rate-coded and spiking networks, as well as combinations of both. The interface in Python has been designed to be close to the PyNN interface, while the definition of neuron and synapse models can be specified using an equation-oriented mathematical description similar to the Brian neural simulator. This information is used to generate C++ code that will efficiently perform the simulation on the chosen parallel hardware (multi-core system or graphical processing unit). Several numerical methods are available to transform ordinary differential equations into an efficient C++code. We compare the parallel performance of the simulator to existing solutions. PMID:26283957
NASA Astrophysics Data System (ADS)
Leamy, Michael J.; Springer, Adam C.
In this research we report parallel implementation of a Cellular Automata-based simulation tool for computing elastodynamic response on complex, two-dimensional domains. Elastodynamic simulation using Cellular Automata (CA) has recently been presented as an alternative, inherently object-oriented technique for accurately and efficiently computing linear and nonlinear wave propagation in arbitrarily-shaped geometries. The local, autonomous nature of the method should lead to straight-forward and efficient parallelization. We address this notion on symmetric multiprocessor (SMP) hardware using a Java-based object-oriented CA code implementing triangular state machines (i.e., automata) and the MPI bindings written in Java (MPJ Express). We use MPJ Express to reconfigure our existing CA code to distribute a domain's automata to cores present on a dual quad-core shared-memory system (eight total processors). We note that this message passing parallelization strategy is directly applicable to computer clustered computing, which will be the focus of follow-on research. Results on the shared memory platform indicate nearly-ideal, linear speed-up. We conclude that the CA-based elastodynamic simulator is easily configured to run in parallel, and yields excellent speed-up on SMP hardware.
NASA Technical Reports Server (NTRS)
Keymeulen, Didier; Ferguson, Michael I.; Fink, Wolfgang; Oks, Boris; Peay, Chris; Terrile, Richard; Cheng, Yen; Kim, Dennis; MacDonald, Eric; Foor, David
2005-01-01
We propose a tuning method for MEMS gyroscopes based on evolutionary computation to efficiently increase the sensitivity of MEMS gyroscopes through tuning. The tuning method was tested for the second generation JPL/Boeing Post-resonator MEMS gyroscope using the measurement of the frequency response of the MEMS device in open-loop operation. We also report on the development of a hardware platform for integrated tuning and closed loop operation of MEMS gyroscopes. The control of this device is implemented through a digital design on a Field Programmable Gate Array (FPGA). The hardware platform easily transitions to an embedded solution that allows for the miniaturization of the system to a single chip.
New Directions for Hardware-assisted Trusted Computing Policies (Position Paper)
NASA Astrophysics Data System (ADS)
Bratus, Sergey; Locasto, Michael E.; Ramaswamy, Ashwin; Smith, Sean W.
The basic technological building blocks of the TCG architecture seem to be stabilizing. As a result, we believe that the focus of the Trusted Computing (TC) discipline must naturally shift from the design and implementation of the hardware root of trust (and the subsequent trust chain) to the higher-level application policies. Such policies must build on these primitives to express new sets of security goals. We highlight the relationship between enforcing these types of policies and debugging, since both activities establish the link between expected and actual application behavior. We argue that this new class of policies better fits developers' mental models of expected application behaviors, and we suggest a hardware design direction for enabling the efficient interpretation of such policies.
A forecast-based STDP rule suitable for neuromorphic implementation.
Davies, S; Galluppi, F; Rast, A D; Furber, S B
2012-08-01
Artificial neural networks increasingly involve spiking dynamics to permit greater computational efficiency. This becomes especially attractive for on-chip implementation using dedicated neuromorphic hardware. However, both spiking neural networks and neuromorphic hardware have historically found difficulties in implementing efficient, effective learning rules. The best-known spiking neural network learning paradigm is Spike Timing Dependent Plasticity (STDP) which adjusts the strength of a connection in response to the time difference between the pre- and post-synaptic spikes. Approaches that relate learning features to the membrane potential of the post-synaptic neuron have emerged as possible alternatives to the more common STDP rule, with various implementations and approximations. Here we use a new type of neuromorphic hardware, SpiNNaker, which represents the flexible "neuromimetic" architecture, to demonstrate a new approach to this problem. Based on the standard STDP algorithm with modifications and approximations, a new rule, called STDP TTS (Time-To-Spike) relates the membrane potential with the Long Term Potentiation (LTP) part of the basic STDP rule. Meanwhile, we use the standard STDP rule for the Long Term Depression (LTD) part of the algorithm. We show that on the basis of the membrane potential it is possible to make a statistical prediction of the time needed by the neuron to reach the threshold, and therefore the LTP part of the STDP algorithm can be triggered when the neuron receives a spike. In our system these approximations allow efficient memory access, reducing the overall computational time and the memory bandwidth required. The improvements here presented are significant for real-time applications such as the ones for which the SpiNNaker system has been designed. We present simulation results that show the efficacy of this algorithm using one or more input patterns repeated over the whole time of the simulation. On-chip results show that the STDP TTS algorithm allows the neural network to adapt and detect the incoming pattern with improvements both in the reliability of, and the time required for, consistent output. Through the approximations we suggest in this paper, we introduce a learning rule that is easy to implement both in event-driven simulators and in dedicated hardware, reducing computational complexity relative to the standard STDP rule. Such a rule offers a promising solution, complementary to standard STDP evaluation algorithms, for real-time learning using spiking neural networks in time-critical applications. Copyright © 2012 Elsevier Ltd. All rights reserved.
Fast and Adaptive Lossless On-Board Hyperspectral Data Compression System for Space Applications
NASA Technical Reports Server (NTRS)
Aranki, Nazeeh; Bakhshi, Alireza; Keymeulen, Didier; Klimesh, Matthew
2009-01-01
Efficient on-board lossless hyperspectral data compression reduces the data volume necessary to meet NASA and DoD limited downlink capabilities. The techniques also improves signature extraction, object recognition and feature classification capabilities by providing exact reconstructed data on constrained downlink resources. At JPL a novel, adaptive and predictive technique for lossless compression of hyperspectral data was recently developed. This technique uses an adaptive filtering method and achieves a combination of low complexity and compression effectiveness that far exceeds state-of-the-art techniques currently in use. The JPL-developed 'Fast Lossless' algorithm requires no training data or other specific information about the nature of the spectral bands for a fixed instrument dynamic range. It is of low computational complexity and thus well-suited for implementation in hardware, which makes it practical for flight implementations of pushbroom instruments. A prototype of the compressor (and decompressor) of the algorithm is available in software, but this implementation may not meet speed and real-time requirements of some space applications. Hardware acceleration provides performance improvements of 10x-100x vs. the software implementation (about 1M samples/sec on a Pentium IV machine). This paper describes a hardware implementation of the JPL-developed 'Fast Lossless' compression algorithm on a Field Programmable Gate Array (FPGA). The FPGA implementation targets the current state of the art FPGAs (Xilinx Virtex IV and V families) and compresses one sample every clock cycle to provide a fast and practical real-time solution for Space applications.
NASA Astrophysics Data System (ADS)
Rais, Muhammad H.
2010-06-01
This paper presents Field Programmable Gate Array (FPGA) implementation of standard and truncated multipliers using Very High Speed Integrated Circuit Hardware Description Language (VHDL). Truncated multiplier is a good candidate for digital signal processing (DSP) applications such as finite impulse response (FIR) and discrete cosine transform (DCT). Remarkable reduction in FPGA resources, delay, and power can be achieved using truncated multipliers instead of standard parallel multipliers when the full precision of the standard multiplier is not required. The truncated multipliers show significant improvement as compared to standard multipliers. Results show that the anomaly in Spartan-3 AN average connection and maximum pin delay have been efficiently reduced in Virtex-4 device.
Noise reduction and image enhancement using a hardware implementation of artificial neural networks
NASA Astrophysics Data System (ADS)
David, Robert; Williams, Erin; de Tremiolles, Ghislain; Tannhof, Pascal
1999-03-01
In this paper, we present a neural based solution developed for noise reduction and image enhancement using the ZISC, an IBM hardware processor which implements the Restricted Coulomb Energy algorithm and the K-Nearest Neighbor algorithm. Artificial neural networks present the advantages of processing time reduction in comparison with classical models, adaptability, and the weighted property of pattern learning. The goal of the developed application is image enhancement in order to restore old movies (noise reduction, focus correction, etc.), to improve digital television images, or to treat images which require adaptive processing (medical images, spatial images, special effects, etc.). Image results show a quantitative improvement over the noisy image as well as the efficiency of this system. Further enhancements are being examined to improve the output of the system.
Implementation of real-time digital signal processing systems
NASA Technical Reports Server (NTRS)
Narasimha, M.; Peterson, A.; Narayan, S.
1978-01-01
Special purpose hardware implementation of DFT Computers and digital filters is considered in the light of newly introduced algorithms and IC devices. Recent work by Winograd on high-speed convolution techniques for computing short length DFT's, has motivated the development of more efficient algorithms, compared to the FFT, for evaluating the transform of longer sequences. Among these, prime factor algorithms appear suitable for special purpose hardware implementations. Architectural considerations in designing DFT computers based on these algorithms are discussed. With the availability of monolithic multiplier-accumulators, a direct implementation of IIR and FIR filters, using random access memories in place of shift registers, appears attractive. The memory addressing scheme involved in such implementations is discussed. A simple counter set-up to address the data memory in the realization of FIR filters is also described. The combination of a set of simple filters (weighting network) and a DFT computer is shown to realize a bank of uniform bandpass filters. The usefulness of this concept in arriving at a modular design for a million channel spectrum analyzer, based on microprocessors, is discussed.
Techniques of EMG signal analysis: detection, processing, classification and applications
Hussain, M.S.; Mohd-Yasin, F.
2006-01-01
Electromyography (EMG) signals can be used for clinical/biomedical applications, Evolvable Hardware Chip (EHW) development, and modern human computer interaction. EMG signals acquired from muscles require advanced methods for detection, decomposition, processing, and classification. The purpose of this paper is to illustrate the various methodologies and algorithms for EMG signal analysis to provide efficient and effective ways of understanding the signal and its nature. We further point up some of the hardware implementations using EMG focusing on applications related to prosthetic hand control, grasp recognition, and human computer interaction. A comparison study is also given to show performance of various EMG signal analysis methods. This paper provides researchers a good understanding of EMG signal and its analysis procedures. This knowledge will help them develop more powerful, flexible, and efficient applications. PMID:16799694
A FPGA implementation for linearly unmixing a hyperspectral image using OpenCL
NASA Astrophysics Data System (ADS)
Guerra, Raúl; López, Sebastián.; Sarmiento, Roberto
2017-10-01
Hyperspectral imaging systems provide images in which single pixels have information from across the electromagnetic spectrum of the scene under analysis. These systems divide the spectrum into many contiguos channels, which may be even out of the visible part of the spectra. The main advantage of the hyperspectral imaging technology is that certain objects leave unique fingerprints in the electromagnetic spectrum, known as spectral signatures, which allow to distinguish between different materials that may look like the same in a traditional RGB image. Accordingly, the most important hyperspectral imaging applications are related with distinguishing or identifying materials in a particular scene. In hyperspectral imaging applications under real-time constraints, the huge amount of information provided by the hyperspectral sensors has to be rapidly processed and analysed. For such purpose, parallel hardware devices, such as Field Programmable Gate Arrays (FPGAs) are typically used. However, developing hardware applications typically requires expertise in the specific targeted device, as well as in the tools and methodologies which can be used to perform the implementation of the desired algorithms in the specific device. In this scenario, the Open Computing Language (OpenCL) emerges as a very interesting solution in which a single high-level synthesis design language can be used to efficiently develop applications in multiple and different hardware devices. In this work, the Fast Algorithm for Linearly Unmixing Hyperspectral Images (FUN) has been implemented into a Bitware Stratix V Altera FPGA using OpenCL. The obtained results demonstrate the suitability of OpenCL as a viable design methodology for quickly creating efficient FPGAs designs for real-time hyperspectral imaging applications.
Real-time computing platform for spiking neurons (RT-spike).
Ros, Eduardo; Ortigosa, Eva M; Agís, Rodrigo; Carrillo, Richard; Arnold, Michael
2006-07-01
A computing platform is described for simulating arbitrary networks of spiking neurons in real time. A hybrid computing scheme is adopted that uses both software and hardware components to manage the tradeoff between flexibility and computational power; the neuron model is implemented in hardware and the network model and the learning are implemented in software. The incremental transition of the software components into hardware is supported. We focus on a spike response model (SRM) for a neuron where the synapses are modeled as input-driven conductances. The temporal dynamics of the synaptic integration process are modeled with a synaptic time constant that results in a gradual injection of charge. This type of model is computationally expensive and is not easily amenable to existing software-based event-driven approaches. As an alternative we have designed an efficient time-based computing architecture in hardware, where the different stages of the neuron model are processed in parallel. Further improvements occur by computing multiple neurons in parallel using multiple processing units. This design is tested using reconfigurable hardware and its scalability and performance evaluated. Our overall goal is to investigate biologically realistic models for the real-time control of robots operating within closed action-perception loops, and so we evaluate the performance of the system on simulating a model of the cerebellum where the emulation of the temporal dynamics of the synaptic integration process is important.
Robust optimization with transiently chaotic dynamical systems
NASA Astrophysics Data System (ADS)
Sumi, R.; Molnár, B.; Ercsey-Ravasz, M.
2014-05-01
Efficiently solving hard optimization problems has been a strong motivation for progress in analog computing. In a recent study we presented a continuous-time dynamical system for solving the NP-complete Boolean satisfiability (SAT) problem, with a one-to-one correspondence between its stable attractors and the SAT solutions. While physical implementations could offer great efficiency, the transiently chaotic dynamics raises the question of operability in the presence of noise, unavoidable on analog devices. Here we show that the probability of finding solutions is robust to noise intensities well above those present on real hardware. We also developed a cellular neural network model realizable with analog circuits, which tolerates even larger noise intensities. These methods represent an opportunity for robust and efficient physical implementations.
Hardware-efficient fermionic simulation with a cavity-QED system
NASA Astrophysics Data System (ADS)
Zhu, Guanyu; Subaşı, Yiǧit; Whitfield, James D.; Hafezi, Mohammad
2018-03-01
In digital quantum simulation of fermionic models with qubits, non-local maps for encoding are often encountered. Such maps require linear or logarithmic overhead in circuit depth which could render the simulation useless, for a given decoherence time. Here we show how one can use a cavity-QED system to perform digital quantum simulation of fermionic models. In particular, we show that highly nonlocal Jordan-Wigner or Bravyi-Kitaev transformations can be efficiently implemented through a hardware approach. The key idea is using ancilla cavity modes, which are dispersively coupled to a qubit string, to collectively manipulate and measure qubit states. Our scheme reduces the circuit depth in each Trotter step of the Jordan-Wigner encoding by a factor of N2, comparing to the scheme for a device with only local connectivity, where N is the number of orbitals for a generic two-body Hamiltonian. Additional analysis for the Fermi-Hubbard model on an N × N square lattice results in a similar reduction. We also discuss a detailed implementation of our scheme with superconducting qubits and cavities.
The Effect of Pulse Shaping QPSK on Bandwidth Efficiency
NASA Technical Reports Server (NTRS)
Purba, Josua Bisuk Mubyarto; Horan, Shelia
1997-01-01
This research investigates the effect of pulse shaping QPSK on bandwidth efficiency over a non-linear channel. This investigation will include software simulations and the hardware implementation. Three kinds of filters: the 5th order Butterworth filter, the 3rd order Bessel filter and the Square Root Raised Cosine filter with a roll off factor (alpha) of 0.25,0.5 and 1, have been investigated as pulse shaping filters. Two different high power amplifiers, one a Traveling Wave Tube Amplifier (TWTA) and the other a Solid State Power Amplifier (SSPA) have been investigated in the hardware implementation. A significant improvement in the bandwidth utilization (rho) for the filtered data compared to unfiltered data through the non-linear channel is shown in the results. This method promises strong performance gains in a bandlimited channel when compared to unfiltered systems. This work was conducted at NMSU in the Center for Space Telemetering, and Telecommunications Systems in the Klipsch School of Electrical and Computer Engineering Department and is supported by a grant from the National Aeronautics and Space Administration (NASA) NAG5-1491.
A Programmable Five Qubit Quantum Computer Using Trapped Atomic Ions
NASA Astrophysics Data System (ADS)
Debnath, Shantanu
Quantum computers can solve certain problems more efficiently compared to conventional classical methods. In the endeavor to build a quantum computer, several competing platforms have emerged that can implement certain quantum algorithms using a few qubits. However, the demonstrations so far have been done usually by tailoring the hardware to meet the requirements of a particular algorithm implemented for a limited number of instances. Although such proof of principal implementations are important to verify the working of algorithms on a physical system, they further need to have the potential to serve as a general purpose quantum computer allowing the flexibility required for running multiple algorithms and be scaled up to host more qubits. Here we demonstrate a small programmable quantum computer based on five trapped atomic ions each of which serves as a qubit. By optically resolving each ion we can individually address them in order to perform a complete set of single-qubit and fully connected two-qubit quantum gates and alsoperform efficient individual qubit measurements. We implement a computation architecture that accepts an algorithm from a user interface in the form of a standard logic gate sequence and decomposes it into fundamental quantum operations that are native to the hardware using a set of compilation instructions that are defined within the software. These operations are then effected through a pattern of laser pulses that perform coherent rotations on targeted qubits in the chain. The architecture implemented in the experiment therefore gives us unprecedented flexibility in the programming of any quantum algorithm while staying blind to the underlying hardware. As a demonstration we implement the Deutsch-Jozsa and Bernstein-Vazirani algorithms on the five-qubit processor and achieve average success rates of 95 and 90 percent, respectively. We also implement a five-qubit coherent quantum Fourier transform and examine its performance in the period finding and phase estimation protocol. We find fidelities of 84 and 62 percent, respectively. While maintaining the same computation architecture the system can be scaled to more ions using resources that scale favorably (O(N. 2)) with the numberof qubits N.
APRON: A Cellular Processor Array Simulation and Hardware Design Tool
NASA Astrophysics Data System (ADS)
Barr, David R. W.; Dudek, Piotr
2009-12-01
We present a software environment for the efficient simulation of cellular processor arrays (CPAs). This software (APRON) is used to explore algorithms that are designed for massively parallel fine-grained processor arrays, topographic multilayer neural networks, vision chips with SIMD processor arrays, and related architectures. The software uses a highly optimised core combined with a flexible compiler to provide the user with tools for the design of new processor array hardware architectures and the emulation of existing devices. We present performance benchmarks for the software processor array implemented on standard commodity microprocessors. APRON can be configured to use additional processing hardware if necessary and can be used as a complete graphical user interface and development environment for new or existing CPA systems, allowing more users to develop algorithms for CPA systems.
An iterative approach to region growing using associative memories
NASA Technical Reports Server (NTRS)
Snyder, W. E.; Cowart, A.
1983-01-01
Region growing, often given as a classical example of the recursive control structures used in image processing which are often awkward to implement in hardware where the intent is the segmentation of an image at raster scan rates, is addressed in light of the postulate that any computation which can be performed recursively can be performed easily and efficiently by iteration coupled with association. Attention is given to an algorithm and hardware structure able to perform region labeling iteratively at scan rates. Every pixel is individually labeled with an identifier which signifies the region to which it belongs. Difficulties otherwise requiring recursion are handled by maintaining an equivalence table in hardware transparent to the computer, which reads the labeled pixels. A simulation of the associative memory has demonstrated its effectiveness.
NASA Technical Reports Server (NTRS)
Jain, A.; Man, G. K.
1993-01-01
This paper describes the Dynamics Algorithms for Real-Time Simulation (DARTS) real-time hardware-in-the-loop dynamics simulator for the National Aeronautics and Space Administration's Cassini spacecraft. The spacecraft model consists of a central flexible body with a number of articulated rigid-body appendages. The demanding performance requirements from the spacecraft control system require the use of a high fidelity simulator for control system design and testing. The DARTS algorithm provides a new algorithmic and hardware approach to the solution of this hardware-in-the-loop simulation problem. It is based upon the efficient spatial algebra dynamics for flexible multibody systems. A parallel and vectorized version of this algorithm is implemented on a low-cost, multiprocessor computer to meet the simulation timing requirements.
Parametric dense stereovision implementation on a system-on chip (SoC).
Gardel, Alfredo; Montejo, Pablo; García, Jorge; Bravo, Ignacio; Lázaro, José L
2012-01-01
This paper proposes a novel hardware implementation of a dense recovery of stereovision 3D measurements. Traditionally 3D stereo systems have imposed the maximum number of stereo correspondences, introducing a large restriction on artificial vision algorithms. The proposed system-on-chip (SoC) provides great performance and efficiency, with a scalable architecture available for many different situations, addressing real time processing of stereo image flow. Using double buffering techniques properly combined with pipelined processing, the use of reconfigurable hardware achieves a parametrisable SoC which gives the designer the opportunity to decide its right dimension and features. The proposed architecture does not need any external memory because the processing is done as image flow arrives. Our SoC provides 3D data directly without the storage of whole stereo images. Our goal is to obtain high processing speed while maintaining the accuracy of 3D data using minimum resources. Configurable parameters may be controlled by later/parallel stages of the vision algorithm executed on an embedded processor. Considering hardware FPGA clock of 100 MHz, image flows up to 50 frames per second (fps) of dense stereo maps of more than 30,000 depth points could be obtained considering 2 Mpix images, with a minimum initial latency. The implementation of computer vision algorithms on reconfigurable hardware, explicitly low level processing, opens up the prospect of its use in autonomous systems, and they can act as a coprocessor to reconstruct 3D images with high density information in real time.
Effect of data truncation in an implementation of pixel clustering on a custom computing machine
NASA Astrophysics Data System (ADS)
Leeser, Miriam E.; Theiler, James P.; Estlick, Michael; Kitaryeva, Natalya V.; Szymanski, John J.
2000-10-01
We investigate the effect of truncating the precision of hyperspectral image data for the purpose of more efficiently segmenting the image using a variant of k-means clustering. We describe the implementation of the algorithm on field-programmable gate array (FPGA) hardware. Truncating the data to only a few bits per pixel in each spectral channel permits a more compact hardware design, enabling greater parallelism, and ultimately a more rapid execution. It also enables the storage of larger images in the onboard memory. In exchange for faster clustering, however, one trades off the quality of the produced segmentation. We find, however, that the clustering algorithm can tolerate considerable data truncation with little degradation in cluster quality. This robustness to truncated data can be extended by computing the cluster centers to a few more bits of precision than the data. Since there are so many more pixels than centers, the more aggressive data truncation leads to significant gains in the number of pixels that can be stored in memory and processed in hardware concurrently.
Speed challenge: a case for hardware implementation in soft-computing
NASA Technical Reports Server (NTRS)
Daud, T.; Stoica, A.; Duong, T.; Keymeulen, D.; Zebulum, R.; Thomas, T.; Thakoor, A.
2000-01-01
For over a decade, JPL has been actively involved in soft computing research on theory, architecture, applications, and electronics hardware. The driving force in all our research activities, in addition to the potential enabling technology promise, has been creation of a niche that imparts orders of magnitude speed advantage by implementation in parallel processing hardware with algorithms made especially suitable for hardware implementation. We review our work on neural networks, fuzzy logic, and evolvable hardware with selected application examples requiring real time response capabilities.
NASA Astrophysics Data System (ADS)
Poinsot, Audrey; Yang, Fan; Brost, Vincent
2011-02-01
Including multiple sources of information in personal identity recognition and verification gives the opportunity to greatly improve performance. We propose a contactless biometric system that combines two modalities: palmprint and face. Hardware implementations are proposed on the Texas Instrument Digital Signal Processor and Xilinx Field-Programmable Gate Array (FPGA) platforms. The algorithmic chain consists of a preprocessing (which includes palm extraction from hand images), Gabor feature extraction, comparison by Hamming distance, and score fusion. Fusion possibilities are discussed and tested first using a bimodal database of 130 subjects that we designed (uB database), and then two common public biometric databases (AR for face and PolyU for palmprint). High performance has been obtained for recognition and verification purpose: a recognition rate of 97.49% with AR-PolyU database and an equal error rate of 1.10% on the uB database using only two training samples per subject have been obtained. Hardware results demonstrate that preprocessing can easily be performed during the acquisition phase, and multimodal biometric recognition can be treated almost instantly (0.4 ms on FPGA). We show the feasibility of a robust and efficient multimodal hardware biometric system that offers several advantages, such as user-friendliness and flexibility.
Efficient implementation of neural network deinterlacing
NASA Astrophysics Data System (ADS)
Seo, Guiwon; Choi, Hyunsoo; Lee, Chulhee
2009-02-01
Interlaced scanning has been widely used in most broadcasting systems. However, there are some undesirable artifacts such as jagged patterns, flickering, and line twitters. Moreover, most recent TV monitors utilize flat panel display technologies such as LCD or PDP monitors and these monitors require progressive formats. Consequently, the conversion of interlaced video into progressive video is required in many applications and a number of deinterlacing methods have been proposed. Recently deinterlacing methods based on neural network have been proposed with good results. On the other hand, with high resolution video contents such as HDTV, the amount of video data to be processed is very large. As a result, the processing time and hardware complexity become an important issue. In this paper, we propose an efficient implementation of neural network deinterlacing using polynomial approximation of the sigmoid function. Experimental results show that these approximations provide equivalent performance with a considerable reduction of complexity. This implementation of neural network deinterlacing can be efficiently incorporated in HW implementation.
High-Performance Design Patterns for Modern Fortran
Haveraaen, Magne; Morris, Karla; Rouson, Damian; ...
2015-01-01
This paper presents ideas for using coordinate-free numerics in modern Fortran to achieve code flexibility in the partial differential equation (PDE) domain. We also show how Fortran, over the last few decades, has changed to become a language well-suited for state-of-the-art software development. Fortran’s new coarray distributed data structure, the language’s class mechanism, and its side-effect-free, pure procedure capability provide the scaffolding on which we implement HPC software. These features empower compilers to organize parallel computations with efficient communication. We present some programming patterns that support asynchronous evaluation of expressions comprised of parallel operations on distributed data. We implemented thesemore » patterns using coarrays and the message passing interface (MPI). We compared the codes’ complexity and performance. The MPI code is much more complex and depends on external libraries. The MPI code on Cray hardware using the Cray compiler is 1.5–2 times faster than the coarray code on the same hardware. The Intel compiler implements coarrays atop Intel’s MPI library with the result apparently being 2–2.5 times slower than manually coded MPI despite exhibiting nearly linear scaling efficiency. As compilers mature and further improvements to coarrays comes in Fortran 2015, we expect this performance gap to narrow.« less
Functional Basis for Efficient Physical Layer Classical Control in Quantum Processors
NASA Astrophysics Data System (ADS)
Ball, Harrison; Nguyen, Trung; Leong, Philip H. W.; Biercuk, Michael J.
2016-12-01
The rapid progress seen in the development of quantum-coherent devices for information processing has motivated serious consideration of quantum computer architecture and organization. One topic which remains open for investigation and optimization relates to the design of the classical-quantum interface, where control operations on individual qubits are applied according to higher-level algorithms; accommodating competing demands on performance and scalability remains a major outstanding challenge. In this work, we present a resource-efficient, scalable framework for the implementation of embedded physical layer classical controllers for quantum-information systems. Design drivers and key functionalities are introduced, leading to the selection of Walsh functions as an effective functional basis for both programing and controller hardware implementation. This approach leverages the simplicity of real-time Walsh-function generation in classical digital hardware, and the fact that a wide variety of physical layer controls, such as dynamic error suppression, are known to fall within the Walsh family. We experimentally implement a real-time field-programmable-gate-array-based Walsh controller producing Walsh timing signals and Walsh-synthesized analog waveforms appropriate for critical tasks in error-resistant quantum control and noise characterization. These demonstrations represent the first step towards a unified framework for the realization of physical layer controls compatible with large-scale quantum-information processing.
An Efficient Randomized Algorithm for Real-Time Process Scheduling in PicOS Operating System
NASA Astrophysics Data System (ADS)
Helmy*, Tarek; Fatai, Anifowose; Sallam, El-Sayed
PicOS is an event-driven operating environment designed for use with embedded networked sensors. More specifically, it is designed to support the concurrency in intensive operations required by networked sensors with minimal hardware requirements. Existing process scheduling algorithms of PicOS; a commercial tiny, low-footprint, real-time operating system; have their associated drawbacks. An efficient, alternative algorithm, based on a randomized selection policy, has been proposed, demonstrated, confirmed for efficiency and fairness, on the average, and has been recommended for implementation in PicOS. Simulations were carried out and performance measures such as Average Waiting Time (AWT) and Average Turn-around Time (ATT) were used to assess the efficiency of the proposed randomized version over the existing ones. The results prove that Randomized algorithm is the best and most attractive for implementation in PicOS, since it is most fair and has the least AWT and ATT on average over the other non-preemptive scheduling algorithms implemented in this paper.
Hardware implementation of hierarchical volume subdivision-based elastic registration.
Dandekar, Omkar; Walimbe, Vivek; Shekhar, Raj
2006-01-01
Real-time, elastic and fully automated 3D image registration is critical to the efficiency and effectiveness of many image-guided diagnostic and treatment procedures relying on multimodality image fusion or serial image comparison. True, real-time performance will make many 3D image registration-based techniques clinically viable. Hierarchical volume subdivision-based image registration techniques are inherently faster than most elastic registration techniques, e.g. free-form deformation (FFD)-based techniques, and are more amenable for achieving real-time performance through hardware acceleration. Our group has previously reported an FPGA-based architecture for accelerating FFD-based image registration. In this article we show how our existing architecture can be adapted to support hierarchical volume subdivision-based image registration. A proof-of-concept implementation of the architecture achieved speedups of 100 for elastic registration against an optimized software implementation on a 3.2 GHz Pentium III Xeon workstation. Due to inherent parallel nature of the hierarchical volume subdivision-based image registration techniques further speedup can be achieved by using several computing modules in parallel.
NASA Astrophysics Data System (ADS)
Gilbert, B. K.; Robb, R. A.; Chu, A.; Kenue, S. K.; Lent, A. H.; Swartzlander, E. E., Jr.
1981-02-01
Rapid advances during the past ten years of several forms of computer-assisted tomography (CT) have resulted in the development of numerous algorithms to convert raw projection data into cross-sectional images. These reconstruction algorithms are either 'iterative,' in which a large matrix algebraic equation is solved by successive approximation techniques; or 'closed form'. Continuing evolution of the closed form algorithms has allowed the newest versions to produce excellent reconstructed images in most applications. This paper will review several computer software and special-purpose digital hardware implementations of closed form algorithms, either proposed during the past several years by a number of workers or actually implemented in commercial or research CT scanners. The discussion will also cover a number of recently investigated algorithmic modifications which reduce the amount of computation required to execute the reconstruction process, as well as several new special-purpose digital hardware implementations under development in laboratories at the Mayo Clinic.
FPGA Implementation of Optimal 3D-Integer DCT Structure for Video Compression
2015-01-01
A novel optimal structure for implementing 3D-integer discrete cosine transform (DCT) is presented by analyzing various integer approximation methods. The integer set with reduced mean squared error (MSE) and high coding efficiency are considered for implementation in FPGA. The proposed method proves that the least resources are utilized for the integer set that has shorter bit values. Optimal 3D-integer DCT structure is determined by analyzing the MSE, power dissipation, coding efficiency, and hardware complexity of different integer sets. The experimental results reveal that direct method of computing the 3D-integer DCT using the integer set [10, 9, 6, 2, 3, 1, 1] performs better when compared to other integer sets in terms of resource utilization and power dissipation. PMID:26601120
Event management for large scale event-driven digital hardware spiking neural networks.
Caron, Louis-Charles; D'Haene, Michiel; Mailhot, Frédéric; Schrauwen, Benjamin; Rouat, Jean
2013-09-01
The interest in brain-like computation has led to the design of a plethora of innovative neuromorphic systems. Individually, spiking neural networks (SNNs), event-driven simulation and digital hardware neuromorphic systems get a lot of attention. Despite the popularity of event-driven SNNs in software, very few digital hardware architectures are found. This is because existing hardware solutions for event management scale badly with the number of events. This paper introduces the structured heap queue, a pipelined digital hardware data structure, and demonstrates its suitability for event management. The structured heap queue scales gracefully with the number of events, allowing the efficient implementation of large scale digital hardware event-driven SNNs. The scaling is linear for memory, logarithmic for logic resources and constant for processing time. The use of the structured heap queue is demonstrated on a field-programmable gate array (FPGA) with an image segmentation experiment and a SNN of 65,536 neurons and 513,184 synapses. Events can be processed at the rate of 1 every 7 clock cycles and a 406×158 pixel image is segmented in 200 ms. Copyright © 2013 Elsevier Ltd. All rights reserved.
Distributed and Modular CAN-Based Architecture for Hardware Control and Sensor Data Integration
Losada, Diego P.; Fernández, Joaquín L.; Paz, Enrique; Sanz, Rafael
2017-01-01
In this article, we present a CAN-based (Controller Area Network) distributed system to integrate sensors, actuators and hardware controllers in a mobile robot platform. With this work, we provide a robust, simple, flexible and open system to make hardware elements or subsystems communicate, that can be applied to different robots or mobile platforms. Hardware modules can be connected to or disconnected from the CAN bus while the system is working. It has been tested in our mobile robot Rato, based on a RWI (Real World Interface) mobile platform, to replace the old sensor and motor controllers. It has also been used in the design of two new robots: BellBot and WatchBot. Currently, our hardware integration architecture supports different sensors, actuators and control subsystems, such as motor controllers and inertial measurement units. The integration architecture was tested and compared with other solutions through a performance analysis of relevant parameters such as transmission efficiency and bandwidth usage. The results conclude that the proposed solution implements a lightweight communication protocol for mobile robot applications that avoids transmission delays and overhead. PMID:28467381
Distributed and Modular CAN-Based Architecture for Hardware Control and Sensor Data Integration.
Losada, Diego P; Fernández, Joaquín L; Paz, Enrique; Sanz, Rafael
2017-05-03
In this article, we present a CAN-based (Controller Area Network) distributed system to integrate sensors, actuators and hardware controllers in a mobile robot platform. With this work, we provide a robust, simple, flexible and open system to make hardware elements or subsystems communicate, that can be applied to different robots or mobile platforms. Hardware modules can be connected to or disconnected from the CAN bus while the system is working. It has been tested in our mobile robot Rato, based on a RWI (Real World Interface) mobile platform, to replace the old sensor and motor controllers. It has also been used in the design of two new robots: BellBot and WatchBot. Currently, our hardware integration architecture supports different sensors, actuators and control subsystems, such as motor controllers and inertial measurement units. The integration architecture was tested and compared with other solutions through a performance analysis of relevant parameters such as transmission efficiency and bandwidth usage. The results conclude that the proposed solution implements a lightweight communication protocol for mobile robot applications that avoids transmission delays and overhead.
Double-tick realization of binary control program
NASA Astrophysics Data System (ADS)
Kobylecki, Michał; Kania, Dariusz
2016-12-01
This paper presents a procedure for the implementation of control algorithms for hardware-bit compatible with the standard IEC61131-3. The described transformation based on the sets of calculus and graphs, allows translation of the original form of the control program to the form in full compliance with the original, giving the architecture represented by two tick. The proposed method enables the efficient implementation of the control bits in the FPGA with the use of a standardized programming language LD.
Enabling High Efficiency Ethanol Engines
DOE Office of Scientific and Technical Information (OSTI.GOV)
Szybist, J.; Confer, K.
2011-03-01
Delphi Automotive Systems and ORNL established this CRADA to explore the potential to improve the energy efficiency of spark-ignited engines operating on ethanol-gasoline blends. By taking advantage of the fuel properties of ethanol, such as high compression ratio and high latent heat of vaporization, it is possible to increase efficiency with ethanol blends. Increasing the efficiency with ethanol-containing blends aims to remove a market barrier of reduced fuel economy with E85 fuel blends, which is currently about 30% lower than with petroleum-derived gasoline. The same or higher engine efficiency is achieved with E85, and the reduction in fuel economy ismore » due to the lower energy density of E85. By making ethanol-blends more efficient, the fuel economy gap between gasoline and E85 can be reduced. In the partnership between Delphi and ORNL, each organization brought a unique and complementary set of skills to the project. Delphi has extensive knowledge and experience in powertrain components and subsystems as well as overcoming real-world implementation barriers. ORNL has extensive knowledge and expertise in non-traditional fuels and improving engine system efficiency for the next generation of internal combustion engines. Partnering to combine these knowledge bases was essential towards making progress to reducing the fuel economy gap between gasoline and E85. ORNL and Delphi maintained strong collaboration throughout the project. Meetings were held regularly, usually on a bi-weekly basis, with additional reports, presentations, and meetings as necessary to maintain progress. Delphi provided substantial hardware support to the project by providing components for the single-cylinder engine experiments, engineering support for hardware modifications, guidance for operational strategies on engine research, and hardware support by providing a flexible multi-cylinder engine to be used for optimizing engine efficiency with ethanol-containing fuels.« less
NASA Technical Reports Server (NTRS)
Mavriplis, D. J.; Das, Raja; Saltz, Joel; Vermeland, R. E.
1992-01-01
An efficient three dimensional unstructured Euler solver is parallelized on a Cray Y-MP C90 shared memory computer and on an Intel Touchstone Delta distributed memory computer. This paper relates the experiences gained and describes the software tools and hardware used in this study. Performance comparisons between two differing architectures are made.
International Conference on Stiff Computation Held at Park City, Utah on April 12, 13 and 14, 1982.
1983-05-31
algorithm should be designed which can analyse a system description and find out for the user ~to which class of problems his system belongs... Dove...processors designed to implement aspecific solution process. yrne: IEE floating point chip design " used by INE and others is an example (Xahan)...the...hardware speciaList has designed his computer such that the paraL#L features can be addressed convenientLy and !! ’) efficientLy, and 4;) the software
Hardware-efficient bosonic quantum error-correcting codes based on symmetry operators
NASA Astrophysics Data System (ADS)
Niu, Murphy Yuezhen; Chuang, Isaac L.; Shapiro, Jeffrey H.
2018-03-01
We establish a symmetry-operator framework for designing quantum error-correcting (QEC) codes based on fundamental properties of the underlying system dynamics. Based on this framework, we propose three hardware-efficient bosonic QEC codes that are suitable for χ(2 )-interaction based quantum computation in multimode Fock bases: the χ(2 ) parity-check code, the χ(2 ) embedded error-correcting code, and the χ(2 ) binomial code. All of these QEC codes detect photon-loss or photon-gain errors by means of photon-number parity measurements, and then correct them via χ(2 ) Hamiltonian evolutions and linear-optics transformations. Our symmetry-operator framework provides a systematic procedure for finding QEC codes that are not stabilizer codes, and it enables convenient extension of a given encoding to higher-dimensional qudit bases. The χ(2 ) binomial code is of special interest because, with m ≤N identified from channel monitoring, it can correct m -photon-loss errors, or m -photon-gain errors, or (m -1 )th -order dephasing errors using logical qudits that are encoded in O (N ) photons. In comparison, other bosonic QEC codes require O (N2) photons to correct the same degree of bosonic errors. Such improved photon efficiency underscores the additional error-correction power that can be provided by channel monitoring. We develop quantum Hamming bounds for photon-loss errors in the code subspaces associated with the χ(2 ) parity-check code and the χ(2 ) embedded error-correcting code, and we prove that these codes saturate their respective bounds. Our χ(2 ) QEC codes exhibit hardware efficiency in that they address the principal error mechanisms and exploit the available physical interactions of the underlying hardware, thus reducing the physical resources required for implementing their encoding, decoding, and error-correction operations, and their universal encoded-basis gate sets.
A Stochastic Spiking Neural Network for Virtual Screening.
Morro, A; Canals, V; Oliver, A; Alomar, M L; Galan-Prado, F; Ballester, P J; Rossello, J L
2018-04-01
Virtual screening (VS) has become a key computational tool in early drug design and screening performance is of high relevance due to the large volume of data that must be processed to identify molecules with the sought activity-related pattern. At the same time, the hardware implementations of spiking neural networks (SNNs) arise as an emerging computing technique that can be applied to parallelize processes that normally present a high cost in terms of computing time and power. Consequently, SNN represents an attractive alternative to perform time-consuming processing tasks, such as VS. In this brief, we present a smart stochastic spiking neural architecture that implements the ultrafast shape recognition (USR) algorithm achieving two order of magnitude of speed improvement with respect to USR software implementations. The neural system is implemented in hardware using field-programmable gate arrays allowing a highly parallelized USR implementation. The results show that, due to the high parallelization of the system, millions of compounds can be checked in reasonable times. From these results, we can state that the proposed architecture arises as a feasible methodology to efficiently enhance time-consuming data-mining processes such as 3-D molecular similarity search.
Efficiently passing messages in distributed spiking neural network simulation.
Thibeault, Corey M; Minkovich, Kirill; O'Brien, Michael J; Harris, Frederick C; Srinivasa, Narayan
2013-01-01
Efficiently passing spiking messages in a neural model is an important aspect of high-performance simulation. As the scale of networks has increased so has the size of the computing systems required to simulate them. In addition, the information exchange of these resources has become more of an impediment to performance. In this paper we explore spike message passing using different mechanisms provided by the Message Passing Interface (MPI). A specific implementation, MVAPICH, designed for high-performance clusters with Infiniband hardware is employed. The focus is on providing information about these mechanisms for users of commodity high-performance spiking simulators. In addition, a novel hybrid method for spike exchange was implemented and benchmarked.
Lunar Applications in Reconfigurable Computing
NASA Technical Reports Server (NTRS)
Somervill, Kevin
2008-01-01
NASA s Constellation Program is developing a lunar surface outpost in which reconfigurable computing will play a significant role. Reconfigurable systems provide a number of benefits over conventional software-based implementations including performance and power efficiency, while the use of standardized reconfigurable hardware provides opportunities to reduce logistical overhead. The current vision for the lunar surface architecture includes habitation, mobility, and communications systems, each of which greatly benefit from reconfigurable hardware in applications including video processing, natural feature recognition, data formatting, IP offload processing, and embedded control systems. In deploying reprogrammable hardware, considerations similar to those of software systems must be managed. There needs to be a mechanism for discovery enabling applications to locate and utilize the available resources. Also, application interfaces are needed to provide for both configuring the resources as well as transferring data between the application and the reconfigurable hardware. Each of these topics are explored in the context of deploying reconfigurable resources as an integral aspect of the lunar exploration architecture.
Massengill, L W; Mundie, D B
1992-01-01
A neural network IC based on a dynamic charge injection is described. The hardware design is space and power efficient, and achieves massive parallelism of analog inner products via charge-based multipliers and spatially distributed summing buses. Basic synaptic cells are constructed of exponential pulse-decay modulation (EPDM) dynamic injection multipliers operating sequentially on propagating signal vectors and locally stored analog weights. Individually adjustable gain controls on each neutron reduce the effects of limited weight dynamic range. A hardware simulator/trainer has been developed which incorporates the physical (nonideal) characteristics of actual circuit components into the training process, thus absorbing nonlinearities and parametric deviations into the macroscopic performance of the network. Results show that charge-based techniques may achieve a high degree of neural density and throughput using standard CMOS processes.
Further optimization of SeDDaRA blind image deconvolution algorithm and its DSP implementation
NASA Astrophysics Data System (ADS)
Wen, Bo; Zhang, Qiheng; Zhang, Jianlin
2011-11-01
Efficient algorithm for blind image deconvolution and its high-speed implementation is of great value in practice. Further optimization of SeDDaRA is developed, from algorithm structure to numerical calculation methods. The main optimization covers that, the structure's modularization for good implementation feasibility, reducing the data computation and dependency of 2D-FFT/IFFT, and acceleration of power operation by segmented look-up table. Then the Fast SeDDaRA is proposed and specialized for low complexity. As the final implementation, a hardware system of image restoration is conducted by using the multi-DSP parallel processing. Experimental results show that, the processing time and memory demand of Fast SeDDaRA decreases 50% at least; the data throughput of image restoration system is over 7.8Msps. The optimization is proved efficient and feasible, and the Fast SeDDaRA is able to support the real-time application.
Wang, Jinlong; Lu, Mai; Hu, Yanwen; Chen, Xiaoqiang; Pan, Qiangqiang
2015-12-01
Neuron is the basic unit of the biological neural system. The Hodgkin-Huxley (HH) model is one of the most realistic neuron models on the electrophysiological characteristic description of neuron. Hardware implementation of neuron could provide new research ideas to clinical treatment of spinal cord injury, bionics and artificial intelligence. Based on the HH model neuron and the DSP Builder technology, in the present study, a single HH model neuron hardware implementation was completed in Field Programmable Gate Array (FPGA). The neuron implemented in FPGA was stimulated by different types of current, the action potential response characteristics were analyzed, and the correlation coefficient between numerical simulation result and hardware implementation result were calculated. The results showed that neuronal action potential response of FPGA was highly consistent with numerical simulation result. This work lays the foundation for hardware implementation of neural network.
NASA Technical Reports Server (NTRS)
Burt, Adam O.; Tinker, Michael L.
2014-01-01
In this paper, genetic algorithm based and gradient-based topology optimization is presented in application to a real hardware design problem. Preliminary design of a planetary lander mockup structure is accomplished using these methods that prove to provide major weight savings by addressing the structural efficiency during the design cycle. This paper presents two alternative formulations of the topology optimization problem. The first is the widely-used gradient-based implementation using commercially available algorithms. The second is formulated using genetic algorithms and internally developed capabilities. These two approaches are applied to a practical design problem for hardware that has been built, tested and proven to be functional. Both formulations converged on similar solutions and therefore were proven to be equally valid implementations of the process. This paper discusses both of these formulations at a high level.
NASA Astrophysics Data System (ADS)
Jaithwa, Ishan
Deployment of smart grid technologies is accelerating. Smart grid enables bidirectional flows of energy and energy-related communications. The future electricity grid will look very different from today's power system. Large variable renewable energy sources will provide a greater portion of electricity, small DERs and energy storage systems will become more common, and utilities will operate many different kinds of energy efficiency. All of these changes will add complexity to the grid and require operators to be able to respond to fast dynamic changes to maintain system stability and security. This thesis investigates advanced control technology for grid integration of renewable energy sources and STATCOM systems by verifying them on real time hardware experiments using two different systems: d SPACE and OPAL RT. Three controls: conventional, direct vector control and the intelligent Neural network control were first simulated using Matlab to check the stability and safety of the system and were then implemented on real time hardware using the d SPACE and OPAL RT systems. The thesis then shows how dynamic-programming (DP) methods employed to train the neural networks are better than any other controllers where, an optimal control strategy is developed to ensure effective power delivery and to improve system stability. Through real time hardware implementation it is proved that the neural vector control approach produces the fastest response time, low overshoot, and, the best performance compared to the conventional standard vector control method and DCC vector control technique. Finally the entrepreneurial approach taken to drive the technologies from the lab to market via ORANGE ELECTRIC is discussed in brief.
NASA Astrophysics Data System (ADS)
Benkrid, K.; Belkacemi, S.; Sukhsawas, S.
2005-06-01
This paper proposes an integrated framework for the high level design of high performance signal processing algorithms' implementations on FPGAs. The framework emerged from a constant need to rapidly implement increasingly complicated algorithms on FPGAs while maintaining the high performance needed in many real time digital signal processing applications. This is particularly important for application developers who often rely on iterative and interactive development methodologies. The central idea behind the proposed framework is to dynamically integrate high performance structural hardware description languages with higher level hardware languages in other to help satisfy the dual requirement of high level design and high performance implementation. The paper illustrates this by integrating two environments: Celoxica's Handel-C language, and HIDE, a structural hardware environment developed at the Queen's University of Belfast. On the one hand, Handel-C has been proven to be very useful in the rapid design and prototyping of FPGA circuits, especially control intensive ones. On the other hand, HIDE, has been used extensively, and successfully, in the generation of highly optimised parameterisable FPGA cores. In this paper, this is illustrated in the construction of a scalable and fully parameterisable core for image algebra's five core neighbourhood operations, where fully floorplanned efficient FPGA configurations, in the form of EDIF netlists, are generated automatically for instances of the core. In the proposed combined framework, highly optimised data paths are invoked dynamically from within Handel-C, and are synthesized using HIDE. Although the idea might seem simple prima facie, it could have serious implications on the design of future generations of hardware description languages.
Hardware-efficient implementation of digital FIR filter using fast first-order moment algorithm
NASA Astrophysics Data System (ADS)
Cao, Li; Liu, Jianguo; Xiong, Jun; Zhang, Jing
2018-03-01
As the digital finite impulse response (FIR) filter can be transformed into the shift-add form of multiple small-sized firstorder moments, based on the existing fast first-order moment algorithm, this paper presents a novel multiplier-less structure to calculate any number of sequential filtering results in parallel. The theoretical analysis on its hardware and time-complexities reveals that by appropriately setting the degree of parallelism and the decomposition factor of a fixed word width, the proposed structure may achieve better area-time efficiency than the existing two-dimensional (2-D) memoryless-based filter. To evaluate the performance concretely, the proposed designs for different taps along with the existing 2-D memoryless-based filters, are synthesized by Synopsys Design Compiler with 0.18-μm SMIC library. The comparisons show that the proposed design has less area-time complexity and power consumption when the number of filter taps is larger than 48.
A rate-compatible family of protograph-based LDPC codes built by expurgation and lengthening
NASA Technical Reports Server (NTRS)
Dolinar, Sam
2005-01-01
We construct a protograph-based rate-compatible family of low-density parity-check codes that cover a very wide range of rates from 1/2 to 16/17, perform within about 0.5 dB of their capacity limits for all rates, and can be decoded conveniently and efficiently with a common hardware implementation.
NASA Astrophysics Data System (ADS)
Nohara, Mitsuo; Takeuchi, Yoshio; Takahata, Fumio
Up-link power control (UPC) is one of the essential technologies to provide efficient satellite communication systems operated at frequency bands above 10 GHz. A simple and cost-effective UPC scheme applicable to a demand assignment international business satellite communications system has been developed. This paper presents the UPC scheme, including the hardware implementation and its performance.
Targeting multiple heterogeneous hardware platforms with OpenCL
NASA Astrophysics Data System (ADS)
Fox, Paul A.; Kozacik, Stephen T.; Humphrey, John R.; Paolini, Aaron; Kuller, Aryeh; Kelmelis, Eric J.
2014-06-01
The OpenCL API allows for the abstract expression of parallel, heterogeneous computing, but hardware implementations have substantial implementation differences. The abstractions provided by the OpenCL API are often insufficiently high-level to conceal differences in hardware architecture. Additionally, implementations often do not take advantage of potential performance gains from certain features due to hardware limitations and other factors. These factors make it challenging to produce code that is portable in practice, resulting in much OpenCL code being duplicated for each hardware platform being targeted. This duplication of effort offsets the principal advantage of OpenCL: portability. The use of certain coding practices can mitigate this problem, allowing a common code base to be adapted to perform well across a wide range of hardware platforms. To this end, we explore some general practices for producing performant code that are effective across platforms. Additionally, we explore some ways of modularizing code to enable optional optimizations that take advantage of hardware-specific characteristics. The minimum requirement for portability implies avoiding the use of OpenCL features that are optional, not widely implemented, poorly implemented, or missing in major implementations. Exposing multiple levels of parallelism allows hardware to take advantage of the types of parallelism it supports, from the task level down to explicit vector operations. Static optimizations and branch elimination in device code help the platform compiler to effectively optimize programs. Modularization of some code is important to allow operations to be chosen for performance on target hardware. Optional subroutines exploiting explicit memory locality allow for different memory hierarchies to be exploited for maximum performance. The C preprocessor and JIT compilation using the OpenCL runtime can be used to enable some of these techniques, as well as to factor in hardware-specific optimizations as necessary.
An improved real time superresolution FPGA system
NASA Astrophysics Data System (ADS)
Lakshmi Narasimha, Pramod; Mudigoudar, Basavaraj; Yue, Zhanfeng; Topiwala, Pankaj
2009-05-01
In numerous computer vision applications, enhancing the quality and resolution of captured video can be critical. Acquired video is often grainy and low quality due to motion, transmission bottlenecks, etc. Postprocessing can enhance it. Superresolution greatly decreases camera jitter to deliver a smooth, stabilized, high quality video. In this paper, we extend previous work on a real-time superresolution application implemented in ASIC/FPGA hardware. A gradient based technique is used to register the frames at the sub-pixel level. Once we get the high resolution grid, we use an improved regularization technique in which the image is iteratively modified by applying back-projection to get a sharp and undistorted image. The algorithm was first tested in software and migrated to hardware, to achieve 320x240 -> 1280x960, about 30 fps, a stunning superresolution by 16X in total pixels. Various input parameters, such as size of input image, enlarging factor and the number of nearest neighbors, can be tuned conveniently by the user. We use a maximum word size of 32 bits to implement the algorithm in Matlab Simulink as well as in FPGA hardware, which gives us a fine balance between the number of bits and performance. The proposed system is robust and highly efficient. We have shown the performance improvement of the hardware superresolution over the software version (C code).
Parameterized hardware description as object oriented hardware model implementation
NASA Astrophysics Data System (ADS)
Drabik, Pawel K.
2010-09-01
The paper introduces novel model for design, visualization and management of complex, highly adaptive hardware systems. The model settles component oriented environment for both hardware modules and software application. It is developed on parameterized hardware description research. Establishment of stable link between hardware and software, as a purpose of designed and realized work, is presented. Novel programming framework model for the environment, named Graphic-Functional-Components is presented. The purpose of the paper is to present object oriented hardware modeling with mentioned features. Possible model implementation in FPGA chips and its management by object oriented software in Java is described.
NASA Astrophysics Data System (ADS)
Suarez, Hernan; Zhang, Yan R.
2015-05-01
New radar applications need to perform complex algorithms and process large quantity of data to generate useful information for the users. This situation has motivated the search for better processing solutions that include low power high-performance processors, efficient algorithms, and high-speed interfaces. In this work, hardware implementation of adaptive pulse compression for real-time transceiver optimization are presented, they are based on a System-on-Chip architecture for Xilinx devices. This study also evaluates the performance of dedicated coprocessor as hardware accelerator units to speed up and improve the computation of computing-intensive tasks such matrix multiplication and matrix inversion which are essential units to solve the covariance matrix. The tradeoffs between latency and hardware utilization are also presented. Moreover, the system architecture takes advantage of the embedded processor, which is interconnected with the logic resources through the high performance AXI buses, to perform floating-point operations, control the processing blocks, and communicate with external PC through a customized software interface. The overall system functionality is demonstrated and tested for real-time operations using a Ku-band tested together with a low-cost channel emulator for different types of waveforms.
Software-implemented fault insertion: An FTMP example
NASA Technical Reports Server (NTRS)
Czeck, Edward W.; Siewiorek, Daniel P.; Segall, Zary Z.
1987-01-01
This report presents a model for fault insertion through software; describes its implementation on a fault-tolerant computer, FTMP; presents a summary of fault detection, identification, and reconfiguration data collected with software-implemented fault insertion; and compares the results to hardware fault insertion data. Experimental results show detection time to be a function of time of insertion and system workload. For the fault detection time, there is no correlation between software-inserted faults and hardware-inserted faults; this is because hardware-inserted faults must manifest as errors before detection, whereas software-inserted faults immediately exercise the error detection mechanisms. In summary, the software-implemented fault insertion is able to be used as an evaluation technique for the fault-handling capabilities of a system in fault detection, identification and recovery. Although the software-inserted faults do not map directly to hardware-inserted faults, experiments show software-implemented fault insertion is capable of emulating hardware fault insertion, with greater ease and automation.
Khiarak, Mehdi Noormohammadi; Martianova, Ekaterina; Bories, Cyril; Martel, Sylvain; Proulx, Christophe D; De Koninck, Yves; Gosselin, Benoit
2018-06-01
Fluorescence biophotometry measurements require wide dynamic range (DR) and high-sensitivity laboratory apparatus. Indeed, it is often very challenging to accurately resolve the small fluorescence variations in presence of noise and high-background tissue autofluorescence. There is a great need for smaller detectors combining high linearity, high sensitivity, and high-energy efficiency. This paper presents a new biophotometry sensor merging two individual building blocks, namely a low-noise sensing front-end and a order continuous-time modulator (CTSDM), into a single module for enabling high-sensitivity and high energy-efficiency photo-sensing. In particular, a differential CMOS photodetector associated with a differential capacitive transimpedance amplifier-based sensing front-end is merged with an incremental order 1-bit CTSDM to achieve a large DR, low hardware complexity, and high-energy efficiency. The sensor leverages a hardware sharing strategy to simplify the implementation and reduce power consumption. The proposed CMOS biosensor is integrated within a miniature wireless head mountable prototype for enabling biophotometry with a single implantable fiber in the brain of live mice. The proposed biophotometry sensor is implemented in a 0.18- CMOS technology, consuming from a 1.8- supply voltage, while achieving a peak dynamic range of over a 50- input bandwidth, a sensitivity of 24 mV/nW, and a minimum detectable current of 2.46- at a 20- sampling rate.
Near Theoretical Gigabit Link Efficiency for Distributed Data Acquisition Systems
Abu-Nimeh, Faisal T.; Choong, Woon-Seng
2017-01-01
Link efficiency, data integrity, and continuity for high-throughput and real-time systems is crucial. Most of these applications require specialized hardware and operating systems as well as extensive tuning in order to achieve high efficiency. Here, we present an implementation of gigabit Ethernet data streaming which can achieve 99.26% link efficiency while maintaining no packet losses. The design and implementation are built on OpenPET, an opensource data acquisition platform for nuclear medical imaging, where (a) a crate hosting multiple OpenPET detector boards uses a User Datagram Protocol over Internet Protocol (UDP/IP) Ethernet soft-core, that is capable of understanding PAUSE frames, to stream data out to a computer workstation; (b) the receiving computer uses Netmap to allow the processing software (i.e., user space), which is written in Python, to directly receive and manage the network card’s ring buffers, bypassing the operating system kernel’s networking stack; and (c) a multi-threaded application using synchronized queues is implemented in the processing software (Python) to free up the ring buffers as quickly as possible while preserving data integrity and flow continuity. PMID:28630948
Near Theoretical Gigabit Link Efficiency for Distributed Data Acquisition Systems.
Abu-Nimeh, Faisal T; Choong, Woon-Seng
2017-03-01
Link efficiency, data integrity, and continuity for high-throughput and real-time systems is crucial. Most of these applications require specialized hardware and operating systems as well as extensive tuning in order to achieve high efficiency. Here, we present an implementation of gigabit Ethernet data streaming which can achieve 99.26% link efficiency while maintaining no packet losses. The design and implementation are built on OpenPET, an opensource data acquisition platform for nuclear medical imaging, where (a) a crate hosting multiple OpenPET detector boards uses a User Datagram Protocol over Internet Protocol (UDP/IP) Ethernet soft-core, that is capable of understanding PAUSE frames, to stream data out to a computer workstation; (b) the receiving computer uses Netmap to allow the processing software (i.e., user space), which is written in Python, to directly receive and manage the network card's ring buffers, bypassing the operating system kernel's networking stack; and (c) a multi-threaded application using synchronized queues is implemented in the processing software (Python) to free up the ring buffers as quickly as possible while preserving data integrity and flow continuity.
Floating-point scaling technique for sources separation automatic gain control
NASA Astrophysics Data System (ADS)
Fermas, A.; Belouchrani, A.; Ait-Mohamed, O.
2012-07-01
Based on the floating-point representation and taking advantage of scaling factor indetermination in blind source separation (BSS) processing, we propose a scaling technique applied to the separation matrix, to avoid the saturation or the weakness in the recovered source signals. This technique performs an automatic gain control in an on-line BSS environment. We demonstrate the effectiveness of this technique by using the implementation of a division-free BSS algorithm with two inputs, two outputs. The proposed technique is computationally cheaper and efficient for a hardware implementation compared to the Euclidean normalisation.
Embedded Streaming Deep Neural Networks Accelerator With Applications.
Dundar, Aysegul; Jin, Jonghoon; Martini, Berin; Culurciello, Eugenio
2017-07-01
Deep convolutional neural networks (DCNNs) have become a very powerful tool in visual perception. DCNNs have applications in autonomous robots, security systems, mobile phones, and automobiles, where high throughput of the feedforward evaluation phase and power efficiency are important. Because of this increased usage, many field-programmable gate array (FPGA)-based accelerators have been proposed. In this paper, we present an optimized streaming method for DCNNs' hardware accelerator on an embedded platform. The streaming method acts as a compiler, transforming a high-level representation of DCNNs into operation codes to execute applications in a hardware accelerator. The proposed method utilizes maximum computational resources available based on a novel-scheduled routing topology that combines data reuse and data concatenation. It is tested with a hardware accelerator implemented on the Xilinx Kintex-7 XC7K325T FPGA. The system fully explores weight-level and node-level parallelizations of DCNNs and achieves a peak performance of 247 G-ops while consuming less than 4 W of power. We test our system with applications on object classification and object detection in real-world scenarios. Our results indicate high-performance efficiency, outperforming all other presented platforms while running these applications.
Computing Generalized Matrix Inverse on Spiking Neural Substrate.
Shukla, Rohit; Khoram, Soroosh; Jorgensen, Erik; Li, Jing; Lipasti, Mikko; Wright, Stephen
2018-01-01
Emerging neural hardware substrates, such as IBM's TrueNorth Neurosynaptic System, can provide an appealing platform for deploying numerical algorithms. For example, a recurrent Hopfield neural network can be used to find the Moore-Penrose generalized inverse of a matrix, thus enabling a broad class of linear optimizations to be solved efficiently, at low energy cost. However, deploying numerical algorithms on hardware platforms that severely limit the range and precision of representation for numeric quantities can be quite challenging. This paper discusses these challenges and proposes a rigorous mathematical framework for reasoning about range and precision on such substrates. The paper derives techniques for normalizing inputs and properly quantizing synaptic weights originating from arbitrary systems of linear equations, so that solvers for those systems can be implemented in a provably correct manner on hardware-constrained neural substrates. The analytical model is empirically validated on the IBM TrueNorth platform, and results show that the guarantees provided by the framework for range and precision hold under experimental conditions. Experiments with optical flow demonstrate the energy benefits of deploying a reduced-precision and energy-efficient generalized matrix inverse engine on the IBM TrueNorth platform, reflecting 10× to 100× improvement over FPGA and ARM core baselines.
Mechanically verified hardware implementing an 8-bit parallel IO Byzantine agreement processor
NASA Technical Reports Server (NTRS)
Moore, J. Strother
1992-01-01
Consider a network of four processors that use the Oral Messages (Byzantine Generals) Algorithm of Pease, Shostak, and Lamport to achieve agreement in the presence of faults. Bevier and Young have published a functional description of a single processor that, when interconnected appropriately with three identical others, implements this network under the assumption that the four processors step in synchrony. By formalizing the original Pease, et al work, Bevier and Young mechanically proved that such a network achieves fault tolerance. We develop, formalize, and discuss a hardware design that has been mechanically proven to implement their processor. In particular, we formally define mapping functions from the abstract state space of the Bevier-Young processor to a concrete state space of a hardware module and state a theorem that expresses the claim that the hardware correctly implements the processor. We briefly discuss the Brock-Hunt Formal Hardware Description Language which permits designs both to be proved correct with the Boyer-Moore theorem prover and to be expressed in a commercially supported hardware description language for additional electrical analysis and layout. We briefly describe our implementation.
Improving Design Efficiency for Large-Scale Heterogeneous Circuits
NASA Astrophysics Data System (ADS)
Gregerson, Anthony
Despite increases in logic density, many Big Data applications must still be partitioned across multiple computing devices in order to meet their strict performance requirements. Among the most demanding of these applications is high-energy physics (HEP), which uses complex computing systems consisting of thousands of FPGAs and ASICs to process the sensor data created by experiments at particles accelerators such as the Large Hadron Collider (LHC). Designing such computing systems is challenging due to the scale of the systems, the exceptionally high-throughput and low-latency performance constraints that necessitate application-specific hardware implementations, the requirement that algorithms are efficiently partitioned across many devices, and the possible need to update the implemented algorithms during the lifetime of the system. In this work, we describe our research to develop flexible architectures for implementing such large-scale circuits on FPGAs. In particular, this work is motivated by (but not limited in scope to) high-energy physics algorithms for the Compact Muon Solenoid (CMS) experiment at the LHC. To make efficient use of logic resources in multi-FPGA systems, we introduce Multi-Personality Partitioning, a novel form of the graph partitioning problem, and present partitioning algorithms that can significantly improve resource utilization on heterogeneous devices while also reducing inter-chip connections. To reduce the high communication costs of Big Data applications, we also introduce Information-Aware Partitioning, a partitioning method that analyzes the data content of application-specific circuits, characterizes their entropy, and selects circuit partitions that enable efficient compression of data between chips. We employ our information-aware partitioning method to improve the performance of the hardware validation platform for evaluating new algorithms for the CMS experiment. Together, these research efforts help to improve the efficiency and decrease the cost of the developing large-scale, heterogeneous circuits needed to enable large-scale application in high-energy physics and other important areas.
Advanced modulation technology development for earth station demodulator applications
NASA Technical Reports Server (NTRS)
Davis, R. C.; Wernlund, J. V.; Gann, J. A.; Roesch, J. F.; Wright, T.; Crowley, R. D.
1989-01-01
The purpose of this contract was to develop a high rate (200 Mbps), bandwidth efficient, modulation format using low cost hardware, in 1990's technology. The modulation format chosen is 16-ary continuous phase frequency shift keying (CPFSK). The implementation of the modulation format uses a unique combination of a limiter/discriminator followed by an accumulator to determine transmitted phase. An important feature of the modulation scheme is the way coding is applied to efficiently gain back the performance lost by the close spacing of the phase points.
Real-time demonstration hardware for enhanced DPCM video compression algorithm
NASA Technical Reports Server (NTRS)
Bizon, Thomas P.; Whyte, Wayne A., Jr.; Marcopoli, Vincent R.
1992-01-01
The lack of available wideband digital links as well as the complexity of implementation of bandwidth efficient digital video CODECs (encoder/decoder) has worked to keep the cost of digital television transmission too high to compete with analog methods. Terrestrial and satellite video service providers, however, are now recognizing the potential gains that digital video compression offers and are proposing to incorporate compression systems to increase the number of available program channels. NASA is similarly recognizing the benefits of and trend toward digital video compression techniques for transmission of high quality video from space and therefore, has developed a digital television bandwidth compression algorithm to process standard National Television Systems Committee (NTSC) composite color television signals. The algorithm is based on differential pulse code modulation (DPCM), but additionally utilizes a non-adaptive predictor, non-uniform quantizer and multilevel Huffman coder to reduce the data rate substantially below that achievable with straight DPCM. The non-adaptive predictor and multilevel Huffman coder combine to set this technique apart from other DPCM encoding algorithms. All processing is done on a intra-field basis to prevent motion degradation and minimize hardware complexity. Computer simulations have shown the algorithm will produce broadcast quality reconstructed video at an average transmission rate of 1.8 bits/pixel. Hardware implementation of the DPCM circuit, non-adaptive predictor and non-uniform quantizer has been completed, providing realtime demonstration of the image quality at full video rates. Video sampling/reconstruction circuits have also been constructed to accomplish the analog video processing necessary for the real-time demonstration. Performance results for the completed hardware compare favorably with simulation results. Hardware implementation of the multilevel Huffman encoder/decoder is currently under development along with implementation of a buffer control algorithm to accommodate the variable data rate output of the multilevel Huffman encoder. A video CODEC of this type could be used to compress NTSC color television signals where high quality reconstruction is desirable (e.g., Space Station video transmission, transmission direct-to-the-home via direct broadcast satellite systems or cable television distribution to system headends and direct-to-the-home).
Power Strategy in DC/DC Converters to Increase Efficiency of Electrical Stimulators.
Aqueveque, Pablo; Acuña, Vicente; Saavedra, Francisco; Debelle, Adrien; Lonys, Laurent; Julémont, Nicolas; Huberland, François; Godfraind, Carmen; Nonclercq, Antoine
2016-06-13
Power efficiency is critical for electrical stimulators. Battery life of wearable stimulators and wireless power transmission in implanted systems are common limiting factors. Boost DC/DC converters are typically needed to increase the supply voltage of the output stage. Traditionally, boost DC/DC converters are used with fast control to regulate the supply voltage of the output. However, since stimulators are acting as current sources, such voltage regulation is not needed. Banking on this, this paper presents a DC/DC conversion strategy aiming to increase power efficiency. It compares, in terms of efficiency, the traditional use of boost converters to two alternatives that could be implemented in future hardware designs.
Deep learning for medical image segmentation - using the IBM TrueNorth neurosynaptic system
NASA Astrophysics Data System (ADS)
Moran, Steven; Gaonkar, Bilwaj; Whitehead, William; Wolk, Aidan; Macyszyn, Luke; Iyer, Subramanian S.
2018-03-01
Deep convolutional neural networks have found success in semantic image segmentation tasks in computer vision and medical imaging. These algorithms are executed on conventional von Neumann processor architectures or GPUs. This is suboptimal. Neuromorphic processors that replicate the structure of the brain are better-suited to train and execute deep learning models for image segmentation by relying on massively-parallel processing. However, given that they closely emulate the human brain, on-chip hardware and digital memory limitations also constrain them. Adapting deep learning models to execute image segmentation tasks on such chips, requires specialized training and validation. In this work, we demonstrate for the first-time, spinal image segmentation performed using a deep learning network implemented on neuromorphic hardware of the IBM TrueNorth Neurosynaptic System and validate the performance of our network by comparing it to human-generated segmentations of spinal vertebrae and disks. To achieve this on neuromorphic hardware, the training model constrains the coefficients of individual neurons to {-1,0,1} using the Energy Efficient Deep Neuromorphic (EEDN)1 networks training algorithm. Given the 1 million neurons and 256 million synapses, the scale and size of the neural network implemented by the IBM TrueNorth allows us to execute the requisite mapping between segmented images and non-uniform intensity MR images >20 times faster than on a GPU-accelerated network and using <0.1 W. This speed and efficiency implies that a trained neuromorphic chip can be deployed in intra-operative environments where real-time medical image segmentation is necessary.
Applying reconfigurable hardware to the analysis of multispectral and hyperspectral imagery
NASA Astrophysics Data System (ADS)
Leeser, Miriam E.; Belanovic, Pavle; Estlick, Michael; Gokhale, Maya; Szymanski, John J.; Theiler, James P.
2002-01-01
Unsupervised clustering is a powerful technique for processing multispectral and hyperspectral images. Last year, we reported on an implementation of k-means clustering for multispectral images. Our implementation in reconfigurable hardware processed 10 channel multispectral images two orders of magnitude faster than a software implementation of the same algorithm. The advantage of using reconfigurable hardware to accelerate k-means clustering is clear; the disadvantage is the hardware implementation worked for one specific dataset. It is a non-trivial task to change this implementation to handle a dataset with different number of spectral channels, bits per spectral channel, or number of pixels; or to change the number of clusters. These changes required knowledge of the hardware design process and could take several days of a designer's time. Since multispectral data sets come in many shapes and sizes, being able to easily change the k-means implementation for these different data sets is important. For this reason, we have developed a parameterized implementation of the k-means algorithm. Our design is parameterized by the number of pixels in an image, the number of channels per pixel, and the number of bits per channel as well as the number of clusters. These parameters can easily be changed in a few minutes by someone not familiar with the design process. The resulting implementation is very close in performance to the original hardware implementation. It has the added advantage that the parameterized design compiles approximately three times faster than the original.
An efficient decoding for low density parity check codes
NASA Astrophysics Data System (ADS)
Zhao, Ling; Zhang, Xiaolin; Zhu, Manjie
2009-12-01
Low density parity check (LDPC) codes are a class of forward-error-correction codes. They are among the best-known codes capable of achieving low bit error rates (BER) approaching Shannon's capacity limit. Recently, LDPC codes have been adopted by the European Digital Video Broadcasting (DVB-S2) standard, and have also been proposed for the emerging IEEE 802.16 fixed and mobile broadband wireless-access standard. The consultative committee for space data system (CCSDS) has also recommended using LDPC codes in the deep space communications and near-earth communications. It is obvious that LDPC codes will be widely used in wired and wireless communication, magnetic recording, optical networking, DVB, and other fields in the near future. Efficient hardware implementation of LDPC codes is of great interest since LDPC codes are being considered for a wide range of applications. This paper presents an efficient partially parallel decoder architecture suited for quasi-cyclic (QC) LDPC codes using Belief propagation algorithm for decoding. Algorithmic transformation and architectural level optimization are incorporated to reduce the critical path. First, analyze the check matrix of LDPC code, to find out the relationship between the row weight and the column weight. And then, the sharing level of the check node updating units (CNU) and the variable node updating units (VNU) are determined according to the relationship. After that, rearrange the CNU and the VNU, and divide them into several smaller parts, with the help of some assistant logic circuit, these smaller parts can be grouped into CNU during the check node update processing and grouped into VNU during the variable node update processing. These smaller parts are called node update kernel units (NKU) and the assistant logic circuit are called node update auxiliary unit (NAU). With NAUs' help, the two steps of iteration operation are completed by NKUs, which brings in great hardware resource reduction. Meanwhile, efficient techniques have been developed to reduce the computation delay of the node processing units and to minimize hardware overhead for parallel processing. This method may be applied not only to regular LDPC codes, but also to the irregular ones. Based on the proposed architectures, a (7493, 6096) irregular QC-LDPC code decoder is described using verilog hardware design language and implemented on Altera field programmable gate array (FPGA) StratixII EP2S130. The implementation results show that over 20% of logic core size can be saved than conventional partially parallel decoder architectures without any performance degradation. If the decoding clock is 100MHz, the proposed decoder can achieve a maximum (source data) decoding throughput of 133 Mb/s at 18 iterations.
HDL Based FPGA Interface Library for Data Acquisition and Multipurpose Real Time Algorithms
NASA Astrophysics Data System (ADS)
Fernandes, Ana M.; Pereira, R. C.; Sousa, J.; Batista, A. J. N.; Combo, A.; Carvalho, B. B.; Correia, C. M. B. A.; Varandas, C. A. F.
2011-08-01
The inherent parallelism of the logic resources, the flexibility in its configuration and the performance at high processing frequencies makes the field programmable gate array (FPGA) the most suitable device to be used both for real time algorithm processing and data transfer in instrumentation modules. Moreover, the reconfigurability of these FPGA based modules enables exploiting different applications on the same module. When using a reconfigurable module for various applications, the availability of a common interface library for easier implementation of the algorithms on the FPGA leads to more efficient development. The FPGA configuration is usually specified in a hardware description language (HDL) or other higher level descriptive language. The critical paths, such as the management of internal hardware clocks that require deep knowledge of the module behavior shall be implemented in HDL to optimize the timing constraints. The common interface library should include these critical paths, freeing the application designer from hardware complexity and able to choose any of the available high-level abstraction languages for the algorithm implementation. With this purpose a modular Verilog code was developed for the Virtex 4 FPGA of the in-house Transient Recorder and Processor (TRP) hardware module, based on the Advanced Telecommunications Computing Architecture (ATCA), with eight channels sampling at up to 400 MSamples/s (MSPS). The TRP was designed to perform real time Pulse Height Analysis (PHA), Pulse Shape Discrimination (PSD) and Pile-Up Rejection (PUR) algorithms at a high count rate (few Mevent/s). A brief description of this modular code is presented and examples of its use as an interface with end user algorithms, including a PHA with PUR, are described.
A Bitslice Implementation of Anderson's Attack on A5/1
NASA Astrophysics Data System (ADS)
Bulavintsev, Vadim; Semenov, Alexander; Zaikin, Oleg; Kochemazov, Stepan
2018-03-01
The A5/1 keystream generator is a part of Global System for Mobile Communications (GSM) protocol, employed in cellular networks all over the world. Its cryptographic resistance was extensively analyzed in dozens of papers. However, almost all corresponding methods either employ a specific hardware or require an extensive preprocessing stage and significant amounts of memory. In the present study, a bitslice variant of Anderson's Attack on A5/1 is implemented. It requires very little computer memory and no preprocessing. Moreover, the attack can be made even more efficient by harnessing the computing power of modern Graphics Processing Units (GPUs). As a result, using commonly available GPUs this method can quite efficiently recover the secret key using only 64 bits of keystream. To test the performance of the implementation, a volunteer computing project was launched. 10 instances of A5/1 cryptanalysis have been successfully solved in this project in a single week.
PRESAGE: PRivacy-preserving gEnetic testing via SoftwAre Guard Extension.
Chen, Feng; Wang, Chenghong; Dai, Wenrui; Jiang, Xiaoqian; Mohammed, Noman; Al Aziz, Md Momin; Sadat, Md Nazmus; Sahinalp, Cenk; Lauter, Kristin; Wang, Shuang
2017-07-26
Advances in DNA sequencing technologies have prompted a wide range of genomic applications to improve healthcare and facilitate biomedical research. However, privacy and security concerns have emerged as a challenge for utilizing cloud computing to handle sensitive genomic data. We present one of the first implementations of Software Guard Extension (SGX) based securely outsourced genetic testing framework, which leverages multiple cryptographic protocols and minimal perfect hash scheme to enable efficient and secure data storage and computation outsourcing. We compared the performance of the proposed PRESAGE framework with the state-of-the-art homomorphic encryption scheme, as well as the plaintext implementation. The experimental results demonstrated significant performance over the homomorphic encryption methods and a small computational overhead in comparison to plaintext implementation. The proposed PRESAGE provides an alternative solution for secure and efficient genomic data outsourcing in an untrusted cloud by using a hybrid framework that combines secure hardware and multiple crypto protocols.
On optimal infinite impulse response edge detection filters
NASA Technical Reports Server (NTRS)
Sarkar, Sudeep; Boyer, Kim L.
1991-01-01
The authors outline the design of an optimal, computationally efficient, infinite impulse response edge detection filter. The optimal filter is computed based on Canny's high signal to noise ratio, good localization criteria, and a criterion on the spurious response of the filter to noise. An expression for the width of the filter, which is appropriate for infinite-length filters, is incorporated directly in the expression for spurious responses. The three criteria are maximized using the variational method and nonlinear constrained optimization. The optimal filter parameters are tabulated for various values of the filter performance criteria. A complete methodology for implementing the optimal filter using approximating recursive digital filtering is presented. The approximating recursive digital filter is separable into two linear filters operating in two orthogonal directions. The implementation is very simple and computationally efficient, has a constant time of execution for different sizes of the operator, and is readily amenable to real-time hardware implementation.
Automatic Digital Hardware Synthesis
1990-09-01
VHDL to PALASM, a hardware synthesis language. The PALASM description is then directly implemented into a field programmable gate array (FPGAI using...process of translating VHDL to PALASM, a hardware synthesis language. The PALASM description is then directly implemented into a field programmable gate...allows the engineer to use VHDL to create and validate a design, and then to implement it in a gate array. The development of software o translate VHDL
An Efficient Pipeline Wavefront Phase Recovery for the CAFADIS Camera for Extremely Large Telescopes
Magdaleno, Eduardo; Rodríguez, Manuel; Rodríguez-Ramos, José Manuel
2010-01-01
In this paper we show a fast, specialized hardware implementation of the wavefront phase recovery algorithm using the CAFADIS camera. The CAFADIS camera is a new plenoptic sensor patented by the Universidad de La Laguna (Canary Islands, Spain): international patent PCT/ES2007/000046 (WIPO publication number WO/2007/082975). It can simultaneously measure the wavefront phase and the distance to the light source in a real-time process. The pipeline algorithm is implemented using Field Programmable Gate Arrays (FPGA). These devices present architecture capable of handling the sensor output stream using a massively parallel approach and they are efficient enough to resolve several Adaptive Optics (AO) problems in Extremely Large Telescopes (ELTs) in terms of processing time requirements. The FPGA implementation of the wavefront phase recovery algorithm using the CAFADIS camera is based on the very fast computation of two dimensional fast Fourier Transforms (FFTs). Thus we have carried out a comparison between our very novel FPGA 2D-FFTa and other implementations. PMID:22315523
Magdaleno, Eduardo; Rodríguez, Manuel; Rodríguez-Ramos, José Manuel
2010-01-01
In this paper we show a fast, specialized hardware implementation of the wavefront phase recovery algorithm using the CAFADIS camera. The CAFADIS camera is a new plenoptic sensor patented by the Universidad de La Laguna (Canary Islands, Spain): international patent PCT/ES2007/000046 (WIPO publication number WO/2007/082975). It can simultaneously measure the wavefront phase and the distance to the light source in a real-time process. The pipeline algorithm is implemented using Field Programmable Gate Arrays (FPGA). These devices present architecture capable of handling the sensor output stream using a massively parallel approach and they are efficient enough to resolve several Adaptive Optics (AO) problems in Extremely Large Telescopes (ELTs) in terms of processing time requirements. The FPGA implementation of the wavefront phase recovery algorithm using the CAFADIS camera is based on the very fast computation of two dimensional fast Fourier Transforms (FFTs). Thus we have carried out a comparison between our very novel FPGA 2D-FFTa and other implementations.
SVM classifier on chip for melanoma detection.
Afifi, Shereen; GholamHosseini, Hamid; Sinha, Roopak
2017-07-01
Support Vector Machine (SVM) is a common classifier used for efficient classification with high accuracy. SVM shows high accuracy for classifying melanoma (skin cancer) clinical images within computer-aided diagnosis systems used by skin cancer specialists to detect melanoma early and save lives. We aim to develop a medical low-cost handheld device that runs a real-time embedded SVM-based diagnosis system for use in primary care for early detection of melanoma. In this paper, an optimized SVM classifier is implemented onto a recent FPGA platform using the latest design methodology to be embedded into the proposed device for realizing online efficient melanoma detection on a single system on chip/device. The hardware implementation results demonstrate a high classification accuracy of 97.9% and a significant acceleration factor of 26 from equivalent software implementation on an embedded processor, with 34% of resources utilization and 2 watts for power consumption. Consequently, the implemented system meets crucial embedded systems constraints of high performance and low cost, resources utilization and power consumption, while achieving high classification accuracy.
Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism
DOE Office of Scientific and Technical Information (OSTI.GOV)
Meng, Jiayuan; Uram, Thomas; Morozov, Vitali A.
Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patternsmore » (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.« less
Reducing the computational footprint for real-time BCPNN learning
Vogginger, Bernhard; Schüffny, René; Lansner, Anders; Cederström, Love; Partzsch, Johannes; Höppner, Sebastian
2015-01-01
The implementation of synaptic plasticity in neural simulation or neuromorphic hardware is usually very resource-intensive, often requiring a compromise between efficiency and flexibility. A versatile, but computationally-expensive plasticity mechanism is provided by the Bayesian Confidence Propagation Neural Network (BCPNN) paradigm. Building upon Bayesian statistics, and having clear links to biological plasticity processes, the BCPNN learning rule has been applied in many fields, ranging from data classification, associative memory, reward-based learning, probabilistic inference to cortical attractor memory networks. In the spike-based version of this learning rule the pre-, postsynaptic and coincident activity is traced in three low-pass-filtering stages, requiring a total of eight state variables, whose dynamics are typically simulated with the fixed step size Euler method. We derive analytic solutions allowing an efficient event-driven implementation of this learning rule. Further speedup is achieved by first rewriting the model which reduces the number of basic arithmetic operations per update to one half, and second by using look-up tables for the frequently calculated exponential decay. Ultimately, in a typical use case, the simulation using our approach is more than one order of magnitude faster than with the fixed step size Euler method. Aiming for a small memory footprint per BCPNN synapse, we also evaluate the use of fixed-point numbers for the state variables, and assess the number of bits required to achieve same or better accuracy than with the conventional explicit Euler method. All of this will allow a real-time simulation of a reduced cortex model based on BCPNN in high performance computing. More important, with the analytic solution at hand and due to the reduced memory bandwidth, the learning rule can be efficiently implemented in dedicated or existing digital neuromorphic hardware. PMID:25657618
Specialized Computer Systems for Environment Visualization
NASA Astrophysics Data System (ADS)
Al-Oraiqat, Anas M.; Bashkov, Evgeniy A.; Zori, Sergii A.
2018-06-01
The need for real time image generation of landscapes arises in various fields as part of tasks solved by virtual and augmented reality systems, as well as geographic information systems. Such systems provide opportunities for collecting, storing, analyzing and graphically visualizing geographic data. Algorithmic and hardware software tools for increasing the realism and efficiency of the environment visualization in 3D visualization systems are proposed. This paper discusses a modified path tracing algorithm with a two-level hierarchy of bounding volumes and finding intersections with Axis-Aligned Bounding Box. The proposed algorithm eliminates the branching and hence makes the algorithm more suitable to be implemented on the multi-threaded CPU and GPU. A modified ROAM algorithm is used to solve the qualitative visualization of reliefs' problems and landscapes. The algorithm is implemented on parallel systems—cluster and Compute Unified Device Architecture-networks. Results show that the implementation on MPI clusters is more efficient than Graphics Processing Unit/Graphics Processing Clusters and allows real-time synthesis. The organization and algorithms of the parallel GPU system for the 3D pseudo stereo image/video synthesis are proposed. With realizing possibility analysis on a parallel GPU-architecture of each stage, 3D pseudo stereo synthesis is performed. An experimental prototype of a specialized hardware-software system 3D pseudo stereo imaging and video was developed on the CPU/GPU. The experimental results show that the proposed adaptation of 3D pseudo stereo imaging to the architecture of GPU-systems is efficient. Also it accelerates the computational procedures of 3D pseudo-stereo synthesis for the anaglyph and anamorphic formats of the 3D stereo frame without performing optimization procedures. The acceleration is on average 11 and 54 times for test GPUs.
Reducing the computational footprint for real-time BCPNN learning.
Vogginger, Bernhard; Schüffny, René; Lansner, Anders; Cederström, Love; Partzsch, Johannes; Höppner, Sebastian
2015-01-01
The implementation of synaptic plasticity in neural simulation or neuromorphic hardware is usually very resource-intensive, often requiring a compromise between efficiency and flexibility. A versatile, but computationally-expensive plasticity mechanism is provided by the Bayesian Confidence Propagation Neural Network (BCPNN) paradigm. Building upon Bayesian statistics, and having clear links to biological plasticity processes, the BCPNN learning rule has been applied in many fields, ranging from data classification, associative memory, reward-based learning, probabilistic inference to cortical attractor memory networks. In the spike-based version of this learning rule the pre-, postsynaptic and coincident activity is traced in three low-pass-filtering stages, requiring a total of eight state variables, whose dynamics are typically simulated with the fixed step size Euler method. We derive analytic solutions allowing an efficient event-driven implementation of this learning rule. Further speedup is achieved by first rewriting the model which reduces the number of basic arithmetic operations per update to one half, and second by using look-up tables for the frequently calculated exponential decay. Ultimately, in a typical use case, the simulation using our approach is more than one order of magnitude faster than with the fixed step size Euler method. Aiming for a small memory footprint per BCPNN synapse, we also evaluate the use of fixed-point numbers for the state variables, and assess the number of bits required to achieve same or better accuracy than with the conventional explicit Euler method. All of this will allow a real-time simulation of a reduced cortex model based on BCPNN in high performance computing. More important, with the analytic solution at hand and due to the reduced memory bandwidth, the learning rule can be efficiently implemented in dedicated or existing digital neuromorphic hardware.
Farabet, Clément; Paz, Rafael; Pérez-Carrasco, Jose; Zamarreño-Ramos, Carlos; Linares-Barranco, Alejandro; LeCun, Yann; Culurciello, Eugenio; Serrano-Gotarredona, Teresa; Linares-Barranco, Bernabe
2012-01-01
Most scene segmentation and categorization architectures for the extraction of features in images and patches make exhaustive use of 2D convolution operations for template matching, template search, and denoising. Convolutional Neural Networks (ConvNets) are one example of such architectures that can implement general-purpose bio-inspired vision systems. In standard digital computers 2D convolutions are usually expensive in terms of resource consumption and impose severe limitations for efficient real-time applications. Nevertheless, neuro-cortex inspired solutions, like dedicated Frame-Based or Frame-Free Spiking ConvNet Convolution Processors, are advancing real-time visual processing. These two approaches share the neural inspiration, but each of them solves the problem in different ways. Frame-Based ConvNets process frame by frame video information in a very robust and fast way that requires to use and share the available hardware resources (such as: multipliers, adders). Hardware resources are fixed- and time-multiplexed by fetching data in and out. Thus memory bandwidth and size is important for good performance. On the other hand, spike-based convolution processors are a frame-free alternative that is able to perform convolution of a spike-based source of visual information with very low latency, which makes ideal for very high-speed applications. However, hardware resources need to be available all the time and cannot be time-multiplexed. Thus, hardware should be modular, reconfigurable, and expansible. Hardware implementations in both VLSI custom integrated circuits (digital and analog) and FPGA have been already used to demonstrate the performance of these systems. In this paper we present a comparison study of these two neuro-inspired solutions. A brief description of both systems is presented and also discussions about their differences, pros and cons. PMID:22518097
Farabet, Clément; Paz, Rafael; Pérez-Carrasco, Jose; Zamarreño-Ramos, Carlos; Linares-Barranco, Alejandro; Lecun, Yann; Culurciello, Eugenio; Serrano-Gotarredona, Teresa; Linares-Barranco, Bernabe
2012-01-01
Most scene segmentation and categorization architectures for the extraction of features in images and patches make exhaustive use of 2D convolution operations for template matching, template search, and denoising. Convolutional Neural Networks (ConvNets) are one example of such architectures that can implement general-purpose bio-inspired vision systems. In standard digital computers 2D convolutions are usually expensive in terms of resource consumption and impose severe limitations for efficient real-time applications. Nevertheless, neuro-cortex inspired solutions, like dedicated Frame-Based or Frame-Free Spiking ConvNet Convolution Processors, are advancing real-time visual processing. These two approaches share the neural inspiration, but each of them solves the problem in different ways. Frame-Based ConvNets process frame by frame video information in a very robust and fast way that requires to use and share the available hardware resources (such as: multipliers, adders). Hardware resources are fixed- and time-multiplexed by fetching data in and out. Thus memory bandwidth and size is important for good performance. On the other hand, spike-based convolution processors are a frame-free alternative that is able to perform convolution of a spike-based source of visual information with very low latency, which makes ideal for very high-speed applications. However, hardware resources need to be available all the time and cannot be time-multiplexed. Thus, hardware should be modular, reconfigurable, and expansible. Hardware implementations in both VLSI custom integrated circuits (digital and analog) and FPGA have been already used to demonstrate the performance of these systems. In this paper we present a comparison study of these two neuro-inspired solutions. A brief description of both systems is presented and also discussions about their differences, pros and cons.
Low-power cryptographic coprocessor for autonomous wireless sensor networks
NASA Astrophysics Data System (ADS)
Olszyna, Jakub; Winiecki, Wiesław
2013-10-01
The concept of autonomous wireless sensor networks involves energy harvesting, as well as effective management of system resources. Public-key cryptography (PKC) offers the advantage of elegant key agreement schemes with which a secret key can be securely established over unsecure channels. In addition to solving the key management problem, the other major application of PKC is digital signatures, with which non-repudiation of messages exchanges can be achieved. The motivation for studying low-power and area efficient modular arithmetic algorithms comes from enabling public-key security for low-power devices that can perform under constrained environment like autonomous wireless sensor networks. This paper presents a cryptographic coprocessor tailored to the autonomous wireless sensor networks constraints. Such hardware circuit is aimed to support the implementation of different public-key cryptosystems based on modular arithmetic in GF(p) and GF(2m). Key components of the coprocessor are described as GEZEL models and can be easily transformed to VHDL and implemented in hardware.
Neighbour lists for smoothed particle hydrodynamics on GPUs
NASA Astrophysics Data System (ADS)
Winkler, Daniel; Rezavand, Massoud; Rauch, Wolfgang
2018-04-01
The efficient iteration of neighbouring particles is a performance critical aspect of any high performance smoothed particle hydrodynamics (SPH) solver. SPH solvers that implement a constant smoothing length generally divide the simulation domain into a uniform grid to reduce the computational complexity of the neighbour search. Based on this method, particle neighbours are either stored per grid cell or for each individual particle, denoted as Verlet list. While the latter approach has significantly higher memory requirements, it has the potential for a significant computational speedup. A theoretical comparison is performed to estimate the potential improvements of the method based on unknown hardware dependent factors. Subsequently, the computational performance of both approaches is empirically evaluated on graphics processing units. It is shown that the speedup differs significantly for different hardware, dimensionality and floating point precision. The Verlet list algorithm is implemented as an alternative to the cell linked list approach in the open-source SPH solver DualSPHysics and provided as a standalone software package.
Modular multiplication in GF(p) for public-key cryptography
NASA Astrophysics Data System (ADS)
Olszyna, Jakub
Modular multiplication forms the basis of modular exponentiation which is the core operation of the RSA cryptosystem. It is also present in many other cryptographic algorithms including those based on ECC and HECC. Hence, an efficient implementation of PKC relies on efficient implementation of modular multiplication. The paper presents a survey of most common algorithms for modular multiplication along with hardware architectures especially suitable for cryptographic applications in energy constrained environments. The motivation for studying low-power and areaefficient modular multiplication algorithms comes from enabling public-key security for ultra-low power devices that can perform under constrained environments like wireless sensor networks. Serial architectures for GF(p) are analyzed and presented. Finally proposed architectures are verified and compared according to the amount of power dissipated throughout the operation.
Bio-inspired online variable recruitment control of fluidic artificial muscles
NASA Astrophysics Data System (ADS)
Jenkins, Tyler E.; Chapman, Edward M.; Bryant, Matthew
2016-12-01
This paper details the creation of a hybrid variable recruitment control scheme for fluidic artificial muscle (FAM) actuators with an emphasis on maximizing system efficiency and switching control performance. Variable recruitment is the process of altering a system’s active number of actuators, allowing operation in distinct force regimes. Previously, FAM variable recruitment was only quantified with offline, manual valve switching; this study addresses the creation and characterization of novel, on-line FAM switching control algorithms. The bio-inspired algorithms are implemented in conjunction with a PID and model-based controller, and applied to a simulated plant model. Variable recruitment transition effects and chatter rejection are explored via a sensitivity analysis, allowing a system designer to weigh tradeoffs in actuator modeling, algorithm choice, and necessary hardware. Variable recruitment is further developed through simulation of a robotic arm tracking a variety of spline position inputs, requiring several levels of actuator recruitment. Switching controller performance is quantified and compared with baseline systems lacking variable recruitment. The work extends current variable recruitment knowledge by creating novel online variable recruitment control schemes, and exploring how online actuator recruitment affects system efficiency and control performance. Key topics associated with implementing a variable recruitment scheme, including the effects of modeling inaccuracies, hardware considerations, and switching transition concerns are also addressed.
Potamitis, Ilyas; Rigakis, Iraklis; Fysarakis, Konstantinos
2014-01-01
Certain insects affect cultivations in a detrimental way. A notable case is the olive fruit fly (Bactrocera oleae (Rossi)), that in Europe alone causes billions of euros in crop-loss/per year. Pests can be controlled with aerial and ground bait pesticide sprays, the efficiency of which depends on knowing the time and location of insect infestations as early as possible. The inspection of traps is currently carried out manually. Automatic monitoring traps can enhance efficient monitoring of flying pests by identifying and counting targeted pests as they enter the trap. This work deals with the hardware setup of an insect trap with an embedded optoelectronic sensor that automatically records insects as they fly in the trap. The sensor responsible for detecting the insect is an array of phototransistors receiving light from an infrared LED. The wing-beat recording is based on the interruption of the emitted light due to the partial occlusion from insect's wings as they fly in the trap. We show that the recordings are of high quality paving the way for automatic recognition and transmission of insect detections from the field to a smartphone. This work emphasizes the hardware implementation of the sensor and the detection/counting module giving all necessary implementation details needed to construct it. PMID:25429412
Dense real-time stereo matching using memory efficient semi-global-matching variant based on FPGAs
NASA Astrophysics Data System (ADS)
Buder, Maximilian
2012-06-01
This paper presents a stereo image matching system that takes advantage of a global image matching method. The system is designed to provide depth information for mobile robotic applications. Typical tasks of the proposed system are to assist in obstacle avoidance, SLAM and path planning. Mobile robots pose strong requirements about size, energy consumption, reliability and output quality of the image matching subsystem. Current available systems either rely on active sensors or on local stereo image matching algorithms. The first are only suitable in controlled environments while the second suffer from low quality depth-maps. Top ranking quality results are only achieved by an iterative approach using global image matching and color segmentation techniques which are computationally demanding and therefore difficult to be executed in realtime. Attempts were made to still reach realtime performance with global methods by simplifying the routines. The depth maps are at the end almost comparable to local methods. An equally named semi-global algorithm was proposed earlier that shows both very good image matching results and relatively simple operations. A memory efficient variant of the Semi-Global-Matching algorithm is reviewed and adopted for an implementation based on reconfigurable hardware. The implementation is suitable for realtime execution in the field of robotics. It will be shown that the modified version of the efficient Semi-Global-Matching method is delivering equivalent result compared to the original algorithm based on the Middlebury dataset. The system has proven to be capable of processing VGA sized images with a disparity resolution of 64 pixel at 33 frames per second based on low cost to mid-range hardware. In case the focus is shifted to a higher image resolution, 1024×1024-sized stereo frames may be processed with the same hardware at 10 fps. The disparity resolution settings stay unchanged. A mobile system that covers preprocessing, matching and interfacing operations is also presented.
Monitoring and detection platform to prevent anomalous situations in home care.
Villarrubia, Gabriel; Bajo, Javier; De Paz, Juan F; Corchado, Juan M
2014-06-05
Monitoring and tracking people at home usually requires high cost hardware installations, which implies they are not affordable in many situations. This study/paper proposes a monitoring and tracking system for people with medical problems. A virtual organization of agents based on the PANGEA platform, which allows the easy integration of different devices, was created for this study. In this case, a virtual organization was implemented to track and monitor patients carrying a Holter monitor. The system includes the hardware and software required to perform: ECG measurements, monitoring through accelerometers and WiFi networks. Furthermore, the use of interactive television can moderate interactivity with the user. The system makes it possible to merge the information and facilitates patient tracking efficiently with low cost.
A Reasoning Hardware Platform for Real-Time Common-Sense Inference
Barba, Jesús; Santofimia, Maria J.; Dondo, Julio; Rincón, Fernando; Sánchez, Francisco; López, Juan Carlos
2012-01-01
Enabling Ambient Intelligence systems to understand the activities that are taking place in a supervised context is a rather complicated task. Moreover, this task cannot be successfully addressed while overlooking the mechanisms (common-sense knowledge and reasoning) that entitle us, as humans beings, to successfully undertake it. This work is based on the premise that Ambient Intelligence systems will be able to understand and react to context events if common-sense capabilities are embodied in them. However, there are some difficulties that need to be resolved before common-sense capabilities can be fully deployed to Ambient Intelligence. This work presents a hardware accelerated implementation of a common-sense knowledge-base system intended to improve response time and efficiency. PMID:23012540
FPGA-based real-time phase measuring profilometry algorithm design and implementation
NASA Astrophysics Data System (ADS)
Zhan, Guomin; Tang, Hongwei; Zhong, Kai; Li, Zhongwei; Shi, Yusheng
2016-11-01
Phase measuring profilometry (PMP) has been widely used in many fields, like Computer Aided Verification (CAV), Flexible Manufacturing System (FMS) et al. High frame-rate (HFR) real-time vision-based feedback control will be a common demands in near future. However, the instruction time delay in the computer caused by numerous repetitive operations greatly limit the efficiency of data processing. FPGA has the advantages of pipeline architecture and parallel execution, and it fit for handling PMP algorithm. In this paper, we design a fully pipelined hardware architecture for PMP. The functions of hardware architecture includes rectification, phase calculation, phase shifting, and stereo matching. The experiment verified the performance of this method, and the factors that may influence the computation accuracy was analyzed.
NASA Technical Reports Server (NTRS)
Duong, T. A.
2004-01-01
In this paper, we present a new, simple, and optimized hardware architecture sequential learning technique for adaptive Principle Component Analysis (PCA) which will help optimize the hardware implementation in VLSI and to overcome the difficulties of the traditional gradient descent in learning convergence and hardware implementation.
Scalability and Portability of Two Parallel Implementations of ADI
NASA Technical Reports Server (NTRS)
Phung, Thanh; VanderWijngaart, Rob F.
1994-01-01
Two domain decompositions for the implementation of the NAS Scalar Penta-diagonal Parallel Benchmark on MIMD systems are investigated, namely transposition and multi-partitioning. Hardware platforms considered are the Intel iPSC/860 and Paragon XP/S-15, and clusters of SGI workstations on ethernet, communicating through PVM. It is found that the multi-partitioning strategy offers the kind of coarse granularity that allows scaling up to hundreds of processors on a massively parallel machine. Moreover, efficiency is retained when the code is ported verbatim (save message passing syntax) to a PVM environment on a modest size cluster of workstations.
The Ames Virtual Environment Workstation: Implementation issues and requirements
NASA Technical Reports Server (NTRS)
Fisher, Scott S.; Jacoby, R.; Bryson, S.; Stone, P.; Mcdowall, I.; Bolas, M.; Dasaro, D.; Wenzel, Elizabeth M.; Coler, C.; Kerr, D.
1991-01-01
This presentation describes recent developments in the implementation of a virtual environment workstation in the Aerospace Human Factors Research Division of NASA's Ames Research Center. Introductory discussions are presented on the primary research objectives and applications of the system and on the system's current hardware and software configuration. Principle attention is then focused on unique issues and problems encountered in the workstation's development with emphasis on its ability to meet original design specifications for computational graphics performance and for associated human factors requirements necessary to provide compelling sense of presence and efficient interaction in the virtual environment.
Optimizations of a Hardware Decoder for Deep-Space Optical Communications
NASA Technical Reports Server (NTRS)
Cheng, Michael K.; Nakashima, Michael A.; Moision, Bruce E.; Hamkins, Jon
2007-01-01
The National Aeronautics and Space Administration has developed a capacity approaching modulation and coding scheme that comprises a serial concatenation of an inner accumulate pulse-position modulation (PPM) and an outer convolutional code [or serially concatenated PPM (SCPPM)] for deep-space optical communications. Decoding of this code uses the turbo principle. However, due to the nonbinary property of SCPPM, a straightforward application of classical turbo decoding is very inefficient. Here, we present various optimizations applicable in hardware implementation of the SCPPM decoder. More specifically, we feature a Super Gamma computation to efficiently handle parallel trellis edges, a pipeline-friendly 'maxstar top-2' circuit that reduces the max-only approximation penalty, a low-latency cyclic redundancy check circuit for window-based decoders, and a high-speed algorithmic polynomial interleaver that leads to memory savings. Using the featured optimizations, we implement a 6.72 megabits-per-second (Mbps) SCPPM decoder on a single field-programmable gate array (FPGA). Compared to the current data rate of 256 kilobits per second from Mars, the SCPPM coded scheme represents a throughput increase of more than twenty-six fold. Extension to a 50-Mbps decoder on a board with multiple FPGAs follows naturally. We show through hardware simulations that the SCPPM coded system can operate within 1 dB of the Shannon capacity at nominal operating conditions.
Compact Modbus TCP/IP protocol for data acquisition systems based on limited hardware resources
NASA Astrophysics Data System (ADS)
Bai, Q.; Jin, B.; Wang, D.; Wang, Y.; Liu, X.
2018-04-01
The Modbus TCP/IP has been a standard industry communication protocol and widely utilized for establishing sensor-cloud platforms on the Internet. However, numerous existing data acquisition systems built on traditional single-chip microcontrollers without sufficient resources cannot support it, because the complete Modbus TCP/IP protocol always works dependent on a full operating system which occupies abundant hardware resources. Hence, a compact Modbus TCP/IP protocol is proposed in this work to make it run efficiently and stably even on a resource-limited hardware platform. Firstly, the Modbus TCP/IP protocol stack is analyzed and the refined protocol suite is rebuilt by streamlining the typical TCP/IP suite. Then, specific implementation of every hierarchical layer is respectively presented in detail according to the protocol structure. Besides, the compact protocol is implemented in a traditional microprocessor to validate the feasibility of the scheme. Finally, the performance of the proposed scenario is assessed. The experimental results demonstrate that message packets match the frame format of Modbus TCP/IP protocol and the average bandwidth reaches to 1.15 Mbps. The compact protocol operates stably even based on a traditional microcontroller with only 4-kB RAM and 12-MHz system clock, and no communication congestion or frequent packet loss occurs.
NASA Astrophysics Data System (ADS)
Ba, Seydou N.; Waheed, Khurram; Zhou, G. Tong
2010-12-01
Digital predistortion is an effective means to compensate for the nonlinear effects of a memoryless system. In case of a cellular transmitter, a digital baseband predistorter can mitigate the undesirable nonlinear effects along the signal chain, particularly the nonlinear impairments in the radiofrequency (RF) amplifiers. To be practically feasible, the implementation complexity of the predistorter must be minimized so that it becomes a cost-effective solution for the resource-limited wireless handset. This paper proposes optimizations that facilitate the design of a low-cost high-performance adaptive digital baseband predistorter for memoryless systems. A comparative performance analysis of the amplitude and power lookup table (LUT) indexing schemes is presented. An optimized low-complexity amplitude approximation and its hardware synthesis results are also studied. An efficient LUT predistorter training algorithm that combines the fast convergence speed of the normalized least mean squares (NLMSs) with a small hardware footprint is proposed. Results of fixed-point simulations based on the measured nonlinear characteristics of an RF amplifier are presented.
A Streaming PCA VLSI Chip for Neural Data Compression.
Wu, Tong; Zhao, Wenfeng; Guo, Hongsun; Lim, Hubert H; Yang, Zhi
2017-12-01
Neural recording system miniaturization and integration with low-power wireless technologies require compressing neural data before transmission. Feature extraction is a procedure to represent data in a low-dimensional space; its integration into a recording chip can be an efficient approach to compress neural data. In this paper, we propose a streaming principal component analysis algorithm and its microchip implementation to compress multichannel local field potential (LFP) and spike data. The circuits have been designed in a 65-nm CMOS technology and occupy a silicon area of 0.06 mm. Throughout the experiments, the chip compresses LFPs by 10 at the expense of as low as 1% reconstruction errors and 144-nW/channel power consumption; for spikes, the achieved compression ratio is 25 with 8% reconstruction errors and 3.05-W/channel power consumption. In addition, the algorithm and its hardware architecture can swiftly adapt to nonstationary spiking activities, which enables efficient hardware sharing among multiple channels to support a high-channel count recorder.
High-throughput sample adaptive offset hardware architecture for high-efficiency video coding
NASA Astrophysics Data System (ADS)
Zhou, Wei; Yan, Chang; Zhang, Jingzhi; Zhou, Xin
2018-03-01
A high-throughput hardware architecture for a sample adaptive offset (SAO) filter in the high-efficiency video coding video coding standard is presented. First, an implementation-friendly and simplified bitrate estimation method of rate-distortion cost calculation is proposed to reduce the computational complexity in the mode decision of SAO. Then, a high-throughput VLSI architecture for SAO is presented based on the proposed bitrate estimation method. Furthermore, multiparallel VLSI architecture for in-loop filters, which integrates both deblocking filter and SAO filter, is proposed. Six parallel strategies are applied in the proposed in-loop filters architecture to improve the system throughput and filtering speed. Experimental results show that the proposed in-loop filters architecture can achieve up to 48% higher throughput in comparison with prior work. The proposed architecture can reach a high-operating clock frequency of 297 MHz with TSMC 65-nm library and meet the real-time requirement of the in-loop filters for 8 K × 4 K video format at 132 fps.
Energy conservation and management system using efficient building automation
NASA Astrophysics Data System (ADS)
Ahmed, S. Faiz; Hazry, D.; Tanveer, M. Hassan; Joyo, M. Kamran; Warsi, Faizan A.; Kamarudin, H.; Wan, Khairunizam; Razlan, Zuradzman M.; Shahriman A., B.; Hussain, A. T.
2015-05-01
In countries where the demand and supply gap of electricity is huge and the people are forced to endure increasing hours of load shedding, unnecessary consumption of electricity makes matters even worse. So the importance and need for electricity conservation increases exponentially. This paper outlines a step towards the conservation of energy in general and electricity in particular by employing efficient Building Automation technique. It should be noted that by careful designing and implementation of the Building Automation System, up to 30% to 40% of energy consumption can be reduced, which makes a huge difference for energy saving. In this study above mentioned concept is verified by performing experiment on a prototype experimental room and by implementing efficient building automation technique. For the sake of this efficient automation, Programmable Logic Controller (PLC) is employed as a main controller, monitoring various system parameters and controlling appliances as per required. The hardware test run and experimental findings further clarifies and proved the concept. The added advantage of this project is that it can be implemented to both small and medium level domestic homes thus greatly reducing the overall unnecessary load on the Utility provider.
A hardware-oriented concurrent TZ search algorithm for High-Efficiency Video Coding
NASA Astrophysics Data System (ADS)
Doan, Nghia; Kim, Tae Sung; Rhee, Chae Eun; Lee, Hyuk-Jae
2017-12-01
High-Efficiency Video Coding (HEVC) is the latest video coding standard, in which the compression performance is double that of its predecessor, the H.264/AVC standard, while the video quality remains unchanged. In HEVC, the test zone (TZ) search algorithm is widely used for integer motion estimation because it effectively searches the good-quality motion vector with a relatively small amount of computation. However, the complex computation structure of the TZ search algorithm makes it difficult to implement it in the hardware. This paper proposes a new integer motion estimation algorithm which is designed for hardware execution by modifying the conventional TZ search to allow parallel motion estimations of all prediction unit (PU) partitions. The algorithm consists of the three phases of zonal, raster, and refinement searches. At the beginning of each phase, the algorithm obtains the search points required by the original TZ search for all PU partitions in a coding unit (CU). Then, all redundant search points are removed prior to the estimation of the motion costs, and the best search points are then selected for all PUs. Compared to the conventional TZ search algorithm, experimental results show that the proposed algorithm significantly decreases the Bjøntegaard Delta bitrate (BD-BR) by 0.84%, and it also reduces the computational complexity by 54.54%.
Computing Generalized Matrix Inverse on Spiking Neural Substrate
Shukla, Rohit; Khoram, Soroosh; Jorgensen, Erik; Li, Jing; Lipasti, Mikko; Wright, Stephen
2018-01-01
Emerging neural hardware substrates, such as IBM's TrueNorth Neurosynaptic System, can provide an appealing platform for deploying numerical algorithms. For example, a recurrent Hopfield neural network can be used to find the Moore-Penrose generalized inverse of a matrix, thus enabling a broad class of linear optimizations to be solved efficiently, at low energy cost. However, deploying numerical algorithms on hardware platforms that severely limit the range and precision of representation for numeric quantities can be quite challenging. This paper discusses these challenges and proposes a rigorous mathematical framework for reasoning about range and precision on such substrates. The paper derives techniques for normalizing inputs and properly quantizing synaptic weights originating from arbitrary systems of linear equations, so that solvers for those systems can be implemented in a provably correct manner on hardware-constrained neural substrates. The analytical model is empirically validated on the IBM TrueNorth platform, and results show that the guarantees provided by the framework for range and precision hold under experimental conditions. Experiments with optical flow demonstrate the energy benefits of deploying a reduced-precision and energy-efficient generalized matrix inverse engine on the IBM TrueNorth platform, reflecting 10× to 100× improvement over FPGA and ARM core baselines. PMID:29593483
DOE Office of Scientific and Technical Information (OSTI.GOV)
Grant, Ryan E.; Barrett, Brian W.; Pedretti, Kevin
The Portals reference implementation is based on the Portals 4.X API, published by Sandia National Laboratories as a freely available public document. It is designed to be an implementation of the Portals Networking Application Programming Interface and is used by several other upper layer protocols like SHMEM, GASNet and MPI. It is implemented over existing networks, specifically Ethernet and InfiniBand networks. This implementation provides Portals networks functionality and serves as a software emulation of Portals compliant networking hardware. It can be used to develop software using the Portals API prior to the debut of Portals networking hardware, such as Bull’smore » BXI interconnect, as well as a substitute for portals hardware on development platforms that do not have Portals compliant hardware. The reference implementation provides new capabilities beyond that of a typical network, namely the ability to have messages matched in hardware in a way compatible with upper layer software such as MPI or SHMEM. It also offers methods of offloading network operations via triggered operations, which can be used to create offloaded collective operations. Specific details on the Portals API can be found at http://portals4.org.« less
Mostafa, Hesham; Pedroni, Bruno; Sheik, Sadique; Cauwenberghs, Gert
2017-01-01
Artificial neural networks (ANNs) trained using backpropagation are powerful learning architectures that have achieved state-of-the-art performance in various benchmarks. Significant effort has been devoted to developing custom silicon devices to accelerate inference in ANNs. Accelerating the training phase, however, has attracted relatively little attention. In this paper, we describe a hardware-efficient on-line learning technique for feedforward multi-layer ANNs that is based on pipelined backpropagation. Learning is performed in parallel with inference in the forward pass, removing the need for an explicit backward pass and requiring no extra weight lookup. By using binary state variables in the feedforward network and ternary errors in truncated-error backpropagation, the need for any multiplications in the forward and backward passes is removed, and memory requirements for the pipelining are drastically reduced. Further reduction in addition operations owing to the sparsity in the forward neural and backpropagating error signal paths contributes to highly efficient hardware implementation. For proof-of-concept validation, we demonstrate on-line learning of MNIST handwritten digit classification on a Spartan 6 FPGA interfacing with an external 1Gb DDR2 DRAM, that shows small degradation in test error performance compared to an equivalently sized binary ANN trained off-line using standard back-propagation and exact errors. Our results highlight an attractive synergy between pipelined backpropagation and binary-state networks in substantially reducing computation and memory requirements, making pipelined on-line learning practical in deep networks. PMID:28932180
Mostafa, Hesham; Pedroni, Bruno; Sheik, Sadique; Cauwenberghs, Gert
2017-01-01
Artificial neural networks (ANNs) trained using backpropagation are powerful learning architectures that have achieved state-of-the-art performance in various benchmarks. Significant effort has been devoted to developing custom silicon devices to accelerate inference in ANNs. Accelerating the training phase, however, has attracted relatively little attention. In this paper, we describe a hardware-efficient on-line learning technique for feedforward multi-layer ANNs that is based on pipelined backpropagation. Learning is performed in parallel with inference in the forward pass, removing the need for an explicit backward pass and requiring no extra weight lookup. By using binary state variables in the feedforward network and ternary errors in truncated-error backpropagation, the need for any multiplications in the forward and backward passes is removed, and memory requirements for the pipelining are drastically reduced. Further reduction in addition operations owing to the sparsity in the forward neural and backpropagating error signal paths contributes to highly efficient hardware implementation. For proof-of-concept validation, we demonstrate on-line learning of MNIST handwritten digit classification on a Spartan 6 FPGA interfacing with an external 1Gb DDR2 DRAM, that shows small degradation in test error performance compared to an equivalently sized binary ANN trained off-line using standard back-propagation and exact errors. Our results highlight an attractive synergy between pipelined backpropagation and binary-state networks in substantially reducing computation and memory requirements, making pipelined on-line learning practical in deep networks.
On-Chip Neural Data Compression Based On Compressed Sensing With Sparse Sensing Matrices.
Zhao, Wenfeng; Sun, Biao; Wu, Tong; Yang, Zhi
2018-02-01
On-chip neural data compression is an enabling technique for wireless neural interfaces that suffer from insufficient bandwidth and power budgets to transmit the raw data. The data compression algorithm and its implementation should be power and area efficient and functionally reliable over different datasets. Compressed sensing is an emerging technique that has been applied to compress various neurophysiological data. However, the state-of-the-art compressed sensing (CS) encoders leverage random but dense binary measurement matrices, which incur substantial implementation costs on both power and area that could offset the benefits from the reduced wireless data rate. In this paper, we propose two CS encoder designs based on sparse measurement matrices that could lead to efficient hardware implementation. Specifically, two different approaches for the construction of sparse measurement matrices, i.e., the deterministic quasi-cyclic array code (QCAC) matrix and -sparse random binary matrix [-SRBM] are exploited. We demonstrate that the proposed CS encoders lead to comparable recovery performance. And efficient VLSI architecture designs are proposed for QCAC-CS and -SRBM encoders with reduced area and total power consumption.
Fang, Wai-Chi; Huang, Kuan-Ju; Chou, Chia-Ching; Chang, Jui-Chung; Cauwenberghs, Gert; Jung, Tzyy-Ping
2014-01-01
This is a proposal for an efficient very-large-scale integration (VLSI) design, 16-channel on-line recursive independent component analysis (ORICA) processor ASIC for real-time EEG system, implemented with TSMC 40 nm CMOS technology. ORICA is appropriate to be used in real-time EEG system to separate artifacts because of its highly efficient and real-time process features. The proposed ORICA processor is composed of an ORICA processing unit and a singular value decomposition (SVD) processing unit. Compared with previous work [1], this proposed ORICA processor has enhanced effectiveness and reduced hardware complexity by utilizing a deeper pipeline architecture, shared arithmetic processing unit, and shared registers. The 16-channel random signals which contain 8-channel super-Gaussian and 8-channel sub-Gaussian components are used to analyze the dependence of the source components, and the average correlation coefficient is 0.95452 between the original source signals and extracted ORICA signals. Finally, the proposed ORICA processor ASIC is implemented with TSMC 40 nm CMOS technology, and it consumes 15.72 mW at 100 MHz operating frequency.
Using a hybrid neuron in physiologically inspired models of the basal ganglia.
Thibeault, Corey M; Srinivasa, Narayan
2013-01-01
Our current understanding of the basal ganglia (BG) has facilitated the creation of computational models that have contributed novel theories, explored new functional anatomy and demonstrated results complementing physiological experiments. However, the utility of these models extends beyond these applications. Particularly in neuromorphic engineering, where the basal ganglia's role in computation is important for applications such as power efficient autonomous agents and model-based control strategies. The neurons used in existing computational models of the BG, however, are not amenable for many low-power hardware implementations. Motivated by a need for more hardware accessible networks, we replicate four published models of the BG, spanning single neuron and small networks, replacing the more computationally expensive neuron models with an Izhikevich hybrid neuron. This begins with a network modeling action-selection, where the basal activity levels and the ability to appropriately select the most salient input is reproduced. A Parkinson's disease model is then explored under normal conditions, Parkinsonian conditions and during subthalamic nucleus deep brain stimulation (DBS). The resulting network is capable of replicating the loss of thalamic relay capabilities in the Parkinsonian state and its return under DBS. This is also demonstrated using a network capable of action-selection. Finally, a study of correlation transfer under different patterns of Parkinsonian activity is presented. These networks successfully captured the significant results of the originals studies. This not only creates a foundation for neuromorphic hardware implementations but may also support the development of large-scale biophysical models. The former potentially providing a way of improving the efficacy of DBS and the latter allowing for the efficient simulation of larger more comprehensive networks.
Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications.
Cabezas, Javier; Gelado, Isaac; Stone, John E; Navarro, Nacho; Kirk, David B; Hwu, Wen-Mei
2015-05-01
Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and even poor performance. This paper describes the HPE runtime system, and the associated architecture support, which enables a simple, efficient programming interface for exchanging data between multiple GPUs through either interconnects or cross-node network interfaces. The runtime and architecture support presented in this paper can also be used to support other types of accelerators. We show that the simplified programming interface reduces programming complexity. The research presented in this paper started in 2009. It has been implemented and tested extensively in several generations of HPE runtime systems as well as adopted into the NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011. The availability of real hardware that support key HPE features gives rise to a rare opportunity for studying the effectiveness of the hardware support by running important benchmarks on real runtime and hardware. Experimental results show that in a exemplar heterogeneous system, peer DMA and double-buffering, pinned buffers, and software techniques can improve the inter-accelerator data communication bandwidth by 2×. They can also improve the execution speed by 1.6× for a 3D finite difference, 2.5× for 1D FFT, and 1.6× for merge sort, all measured on real hardware. The proposed architecture support enables the HPE runtime to transparently deploy these optimizations under simple portable user code, allowing system designers to freely employ devices of different capabilities. We further argue that simple interfaces such as HPE are needed for most applications to benefit from advanced hardware features in practice.
Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications
Cabezas, Javier; Gelado, Isaac; Stone, John E.; Navarro, Nacho; Kirk, David B.; Hwu, Wen-mei
2014-01-01
Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and even poor performance. This paper describes the HPE runtime system, and the associated architecture support, which enables a simple, efficient programming interface for exchanging data between multiple GPUs through either interconnects or cross-node network interfaces. The runtime and architecture support presented in this paper can also be used to support other types of accelerators. We show that the simplified programming interface reduces programming complexity. The research presented in this paper started in 2009. It has been implemented and tested extensively in several generations of HPE runtime systems as well as adopted into the NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011. The availability of real hardware that support key HPE features gives rise to a rare opportunity for studying the effectiveness of the hardware support by running important benchmarks on real runtime and hardware. Experimental results show that in a exemplar heterogeneous system, peer DMA and double-buffering, pinned buffers, and software techniques can improve the inter-accelerator data communication bandwidth by 2×. They can also improve the execution speed by 1.6× for a 3D finite difference, 2.5× for 1D FFT, and 1.6× for merge sort, all measured on real hardware. The proposed architecture support enables the HPE runtime to transparently deploy these optimizations under simple portable user code, allowing system designers to freely employ devices of different capabilities. We further argue that simple interfaces such as HPE are needed for most applications to benefit from advanced hardware features in practice. PMID:26180487
Study of a unified hardware and software fault-tolerant architecture
NASA Technical Reports Server (NTRS)
Lala, Jaynarayan; Alger, Linda; Friend, Steven; Greeley, Gregory; Sacco, Stephen; Adams, Stuart
1989-01-01
A unified architectural concept, called the Fault Tolerant Processor Attached Processor (FTP-AP), that can tolerate hardware as well as software faults is proposed for applications requiring ultrareliable computation capability. An emulation of the FTP-AP architecture, consisting of a breadboard Motorola 68010-based quadruply redundant Fault Tolerant Processor, four VAX 750s as attached processors, and four versions of a transport aircraft yaw damper control law, is used as a testbed in the AIRLAB to examine a number of critical issues. Solutions of several basic problems associated with N-Version software are proposed and implemented on the testbed. This includes a confidence voter to resolve coincident errors in N-Version software. A reliability model of N-Version software that is based upon the recent understanding of software failure mechanisms is also developed. The basic FTP-AP architectural concept appears suitable for hosting N-Version application software while at the same time tolerating hardware failures. Architectural enhancements for greater efficiency, software reliability modeling, and N-Version issues that merit further research are identified.
Small Private Key PKS on an Embedded Microprocessor
Seo, Hwajeong; Kim, Jihyun; Choi, Jongseok; Park, Taehwan; Liu, Zhe; Kim, Howon
2014-01-01
Multivariate quadratic ( ) cryptography requires the use of long public and private keys to ensure a sufficient security level, but this is not favorable to embedded systems, which have limited system resources. Recently, various approaches to cryptography using reduced public keys have been studied. As a result of this, at CHES2011 (Cryptographic Hardware and Embedded Systems, 2011), a small public key scheme, was proposed, and its feasible implementation on an embedded microprocessor was reported at CHES2012. However, the implementation of a small private key scheme was not reported. For efficient implementation, random number generators can contribute to reduce the key size, but the cost of using a random number generator is much more complex than computing on modern microprocessors. Therefore, no feasible results have been reported on embedded microprocessors. In this paper, we propose a feasible implementation on embedded microprocessors for a small private key scheme using a pseudo-random number generator and hash function based on a block-cipher exploiting a hardware Advanced Encryption Standard (AES) accelerator. To speed up the performance, we apply various implementation methods, including parallel computation, on-the-fly computation, optimized logarithm representation, vinegar monomials and assembly programming. The proposed method reduces the private key size by about 99.9% and boosts signature generation and verification by 5.78% and 12.19% than previous results in CHES2012. PMID:24651722
A novel pipeline based FPGA implementation of a genetic algorithm
NASA Astrophysics Data System (ADS)
Thirer, Nonel
2014-05-01
To solve problems when an analytical solution is not available, more and more bio-inspired computation techniques have been applied in the last years. Thus, an efficient algorithm is the Genetic Algorithm (GA), which imitates the biological evolution process, finding the solution by the mechanism of "natural selection", where the strong has higher chances to survive. A genetic algorithm is an iterative procedure which operates on a population of individuals called "chromosomes" or "possible solutions" (usually represented by a binary code). GA performs several processes with the population individuals to produce a new population, like in the biological evolution. To provide a high speed solution, pipelined based FPGA hardware implementations are used, with a nstages pipeline for a n-phases genetic algorithm. The FPGA pipeline implementations are constraints by the different execution time of each stage and by the FPGA chip resources. To minimize these difficulties, we propose a bio-inspired technique to modify the crossover step by using non identical twins. Thus two of the chosen chromosomes (parents) will build up two new chromosomes (children) not only one as in classical GA. We analyze the contribution of this method to reduce the execution time in the asynchronous and synchronous pipelines and also the possibility to a cheaper FPGA implementation, by using smaller populations. The full hardware architecture for a FPGA implementation to our target ALTERA development card is presented and analyzed.
Small private key MQPKS on an embedded microprocessor.
Seo, Hwajeong; Kim, Jihyun; Choi, Jongseok; Park, Taehwan; Liu, Zhe; Kim, Howon
2014-03-19
Multivariate quadratic (MQ) cryptography requires the use of long public and private keys to ensure a sufficient security level, but this is not favorable to embedded systems, which have limited system resources. Recently, various approaches to MQ cryptography using reduced public keys have been studied. As a result of this, at CHES2011 (Cryptographic Hardware and Embedded Systems, 2011), a small public key MQ scheme, was proposed, and its feasible implementation on an embedded microprocessor was reported at CHES2012. However, the implementation of a small private key MQ scheme was not reported. For efficient implementation, random number generators can contribute to reduce the key size, but the cost of using a random number generator is much more complex than computing MQ on modern microprocessors. Therefore, no feasible results have been reported on embedded microprocessors. In this paper, we propose a feasible implementation on embedded microprocessors for a small private key MQ scheme using a pseudo-random number generator and hash function based on a block-cipher exploiting a hardware Advanced Encryption Standard (AES) accelerator. To speed up the performance, we apply various implementation methods, including parallel computation, on-the-fly computation, optimized logarithm representation, vinegar monomials and assembly programming. The proposed method reduces the private key size by about 99.9% and boosts signature generation and verification by 5.78% and 12.19% than previous results in CHES2012.
Massively Multithreaded Maxflow for Image Segmentation on the Cray XMT-2
Bokhari, Shahid H.; Çatalyürek, Ümit V.; Gurcan, Metin N.
2014-01-01
SUMMARY Image segmentation is a very important step in the computerized analysis of digital images. The maxflow mincut approach has been successfully used to obtain minimum energy segmentations of images in many fields. Classical algorithms for maxflow in networks do not directly lend themselves to efficient parallel implementations on contemporary parallel processors. We present the results of an implementation of Goldberg-Tarjan preflow-push algorithm on the Cray XMT-2 massively multithreaded supercomputer. This machine has hardware support for 128 threads in each physical processor, a uniformly accessible shared memory of up to 4 TB and hardware synchronization for each 64 bit word. It is thus well-suited to the parallelization of graph theoretic algorithms, such as preflow-push. We describe the implementation of the preflow-push code on the XMT-2 and present the results of timing experiments on a series of synthetically generated as well as real images. Our results indicate very good performance on large images and pave the way for practical applications of this machine architecture for image analysis in a production setting. The largest images we have run are 320002 pixels in size, which are well beyond the largest previously reported in the literature. PMID:25598745
NASA Technical Reports Server (NTRS)
Kizhner, Semion; Flatley, Thomas P.; Hestnes, Phyllis; Jentoft-Nilsen, Marit; Petrick, David J.; Day, John H. (Technical Monitor)
2001-01-01
Spacecraft telemetry rates have steadily increased over the last decade presenting a problem for real-time processing by ground facilities. This paper proposes a solution to a related problem for the Geostationary Operational Environmental Spacecraft (GOES-8) image processing application. Although large super-computer facilities are the obvious heritage solution, they are very costly, making it imperative to seek a feasible alternative engineering solution at a fraction of the cost. The solution is based on a Personal Computer (PC) platform and synergy of optimized software algorithms and re-configurable computing hardware technologies, such as Field Programmable Gate Arrays (FPGA) and Digital Signal Processing (DSP). It has been shown in [1] and [2] that this configuration can provide superior inexpensive performance for a chosen application on the ground station or on-board a spacecraft. However, since this technology is still maturing, intensive pre-hardware steps are necessary to achieve the benefits of hardware implementation. This paper describes these steps for the GOES-8 application, a software project developed using Interactive Data Language (IDL) (Trademark of Research Systems, Inc.) on a Workstation/UNIX platform. The solution involves converting the application to a PC/Windows/RC platform, selected mainly by the availability of low cost, adaptable high-speed RC hardware. In order for the hybrid system to run, the IDL software was modified to account for platform differences. It was interesting to examine the gains and losses in performance on the new platform, as well as unexpected observations before implementing hardware. After substantial pre-hardware optimization steps, the necessity of hardware implementation for bottleneck code in the PC environment became evident and solvable beginning with the methodology described in [1], [2], and implementing a novel methodology for this specific application [6]. The PC-RC interface bandwidth problem for the class of applications with moderate input-output data rates but large intermediate multi-thread data streams has been addressed and mitigated. This opens a new class of satellite image processing applications for bottleneck problems solution using RC technologies. The issue of a science algorithm level of abstraction necessary for RC hardware implementation is also described. Selected Matlab functions already implemented in hardware were investigated for their direct applicability to the GOES-8 application with the intent to create a library of Matlab and IDL RC functions for ongoing work. A complete class of spacecraft image processing applications using embedded re-configurable computing technology to meet real-time requirements, including performance results and comparison with the existing system, is described in this paper.
Photon Counting Using Edge-Detection Algorithm
NASA Technical Reports Server (NTRS)
Gin, Jonathan W.; Nguyen, Danh H.; Farr, William H.
2010-01-01
New applications such as high-datarate, photon-starved, free-space optical communications require photon counting at flux rates into gigaphoton-per-second regimes coupled with subnanosecond timing accuracy. Current single-photon detectors that are capable of handling such operating conditions are designed in an array format and produce output pulses that span multiple sample times. In order to discern one pulse from another and not to overcount the number of incoming photons, a detection algorithm must be applied to the sampled detector output pulses. As flux rates increase, the ability to implement such a detection algorithm becomes difficult within a digital processor that may reside within a field-programmable gate array (FPGA). Systems have been developed and implemented to both characterize gigahertz bandwidth single-photon detectors, as well as process photon count signals at rates into gigaphotons per second in order to implement communications links at SCPPM (serial concatenated pulse position modulation) encoded data rates exceeding 100 megabits per second with efficiencies greater than two bits per detected photon. A hardware edge-detection algorithm and corresponding signal combining and deserialization hardware were developed to meet these requirements at sample rates up to 10 GHz. The photon discriminator deserializer hardware board accepts four inputs, which allows for the ability to take inputs from a quadphoton counting detector, to support requirements for optical tracking with a reduced number of hardware components. The four inputs are hardware leading-edge detected independently. After leading-edge detection, the resultant samples are ORed together prior to deserialization. The deserialization is performed to reduce the rate at which data is passed to a digital signal processor, perhaps residing within an FPGA. The hardware implements four separate analog inputs that are connected through RF connectors. Each analog input is fed to a high-speed 1-bit comparator, which digitizes the input referenced to an adjustable threshold value. This results in four independent serial sample streams of binary 1s and 0s, which are ORed together at rates up to 10 GHz. This single serial stream is then deserialized by a factor of 16 to create 16 signal lines at a rate of 622.5 MHz or lower for input to a high-speed digital processor assembly. The new design and corresponding hardware can be employed with a quad-photon counting detector capable of handling photon rates on the order of multi-gigaphotons per second, whereas prior art was only capable of handling a single input at 1/4 the flux rate. Additionally, the hardware edge-detection algorithm has provided the ability to process 3-10 higher photon flux rates than previously possible by removing the limitation that photoncounting detector output pulses on multiple channels being ORed not overlap. Now, only the leading edges of the pulses are required to not overlap. This new photon counting digitizer hardware architecture supports a universal front end for an optical communications receiver operating at data rates from kilobits to over one gigabit per second to meet increased mission data volume requirements.
Computerized atmospheric trace contaminant control simulation for manned spacecraft
NASA Technical Reports Server (NTRS)
Perry, J. L.
1993-01-01
Buildup of atmospheric trace contaminants in enclosed volumes such as a spacecraft may lead to potentially serious health problems for the crew members. For this reason, active control methods must be implemented to minimize the concentration of atmospheric contaminants to levels that are considered safe for prolonged, continuous exposure. Designing hardware to accomplish this has traditionally required extensive testing to characterize and select appropriate control technologies. Data collected since the Apollo project can now be used in a computerized performance simulation to predict the performance and life of contamination control hardware to allow for initial technology screening, performance prediction, and operations and contingency studies to determine the most suitable hardware approach before specific design and testing activities begin. The program, written in FORTRAN 77, provides contaminant removal rate, total mass removed, and per pass efficiency for each control device for discrete time intervals. In addition, projected cabin concentration is provided. Input and output data are manipulated using commercial spreadsheet and data graphing software. These results can then be used in analyzing hardware design parameters such as sizing and flow rate, overall process performance and program economics. Test performance may also be predicted to aid test design.
Sensor network based vehicle classification and license plate identification system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Frigo, Janette Rose; Brennan, Sean M; Rosten, Edward J
Typically, for energy efficiency and scalability purposes, sensor networks have been used in the context of environmental and traffic monitoring applications in which operations at the sensor level are not computationally intensive. But increasingly, sensor network applications require data and compute intensive sensors such video cameras and microphones. In this paper, we describe the design and implementation of two such systems: a vehicle classifier based on acoustic signals and a license plate identification system using a camera. The systems are implemented in an energy-efficient manner to the extent possible using commercially available hardware, the Mica motes and the Stargate platform.more » Our experience in designing these systems leads us to consider an alternate more flexible, modular, low-power mote architecture that uses a combination of FPGAs, specialized embedded processing units and sensor data acquisition systems.« less
NASA Astrophysics Data System (ADS)
Marek, Repka
2015-01-01
The original McEliece PKC proposal is interesting thanks to its resistance against all known attacks, even using quantum cryptanalysis, in an IND-CCA2 secure conversion. Here we present a generic implementation of the original McEliece PKC proposal, which provides test vectors (for all important intermediate results), and also in which a measurement tool for side-channel analysis is employed. To our best knowledge, this is the first such an implementation. This Calculator is valuable in implementation optimization, in further McEliece/Niederreiter like PKCs properties investigations, and also in teaching. Thanks to that, one can, for example, examine side-channel vulnerability of a certain implementation, or one can find out and test particular parameters of the cryptosystem in order to make them appropriate for an efficient hardware implementation. This implementation is available [1] in executable binary format, and as a static C++ library, as well as in form of source codes, for Linux and Windows operating systems.
Liu, Xing; Hou, Kun Mean; de Vaulx, Christophe; Shi, Hongling; Gholami, Khalid El
2014-01-01
Operating system (OS) technology is significant for the proliferation of the wireless sensor network (WSN). With an outstanding OS; the constrained WSN resources (processor; memory and energy) can be utilized efficiently. Moreover; the user application development can be served soundly. In this article; a new hybrid; real-time; memory-efficient; energy-efficient; user-friendly and fault-tolerant WSN OS MIROS is designed and implemented. MIROS implements the hybrid scheduler and the dynamic memory allocator. Real-time scheduling can thus be achieved with low memory consumption. In addition; it implements a mid-layer software EMIDE (Efficient Mid-layer Software for User-Friendly Application Development Environment) to decouple the WSN application from the low-level system. The application programming process can consequently be simplified and the application reprogramming performance improved. Moreover; it combines both the software and the multi-core hardware techniques to conserve the energy resources; improve the node reliability; as well as achieve a new debugging method. To evaluate the performance of MIROS; it is compared with the other WSN OSes (TinyOS; Contiki; SOS; openWSN and mantisOS) from different OS concerns. The final evaluation results prove that MIROS is suitable to be used even on the tight resource-constrained WSN nodes. It can support the real-time WSN applications. Furthermore; it is energy efficient; user friendly and fault tolerant. PMID:25248069
Liu, Xing; Hou, Kun Mean; de Vaulx, Christophe; Shi, Hongling; El Gholami, Khalid
2014-09-22
Operating system (OS) technology is significant for the proliferation of the wireless sensor network (WSN). With an outstanding OS; the constrained WSN resources (processor; memory and energy) can be utilized efficiently. Moreover; the user application development can be served soundly. In this article; a new hybrid; real-time; memory-efficient; energy-efficient; user-friendly and fault-tolerant WSN OS MIROS is designed and implemented. MIROS implements the hybrid scheduler and the dynamic memory allocator. Real-time scheduling can thus be achieved with low memory consumption. In addition; it implements a mid-layer software EMIDE (Efficient Mid-layer Software for User-Friendly Application Development Environment) to decouple the WSN application from the low-level system. The application programming process can consequently be simplified and the application reprogramming performance improved. Moreover; it combines both the software and the multi-core hardware techniques to conserve the energy resources; improve the node reliability; as well as achieve a new debugging method. To evaluate the performance of MIROS; it is compared with the other WSN OSes (TinyOS; Contiki; SOS; openWSN and mantisOS) from different OS concerns. The final evaluation results prove that MIROS is suitable to be used even on the tight resource-constrained WSN nodes. It can support the real-time WSN applications. Furthermore; it is energy efficient; user friendly and fault tolerant.
Monitoring and Detection Platform to Prevent Anomalous Situations in Home Care
Villarrubia, Gabriel; Bajo, Javier; De Paz, Juan F.; Corchado, Juan M.
2014-01-01
Monitoring and tracking people at home usually requires high cost hardware installations, which implies they are not affordable in many situations. This study/paper proposes a monitoring and tracking system for people with medical problems. A virtual organization of agents based on the PANGEA platform, which allows the easy integration of different devices, was created for this study. In this case, a virtual organization was implemented to track and monitor patients carrying a Holter monitor. The system includes the hardware and software required to perform: ECG measurements, monitoring through accelerometers and WiFi networks. Furthermore, the use of interactive television can moderate interactivity with the user. The system makes it possible to merge the information and facilitates patient tracking efficiently with low cost. PMID:24905853
Real Time Phase Noise Meter Based on a Digital Signal Processor
NASA Technical Reports Server (NTRS)
Angrisani, Leopoldo; D'Arco, Mauro; Greenhall, Charles A.; Schiano Lo Morille, Rosario
2006-01-01
A digital signal-processing meter for phase noise measurement on sinusoidal signals is dealt with. It enlists a special hardware architecture, made up of a core digital signal processor connected to a data acquisition board, and takes advantage of a quadrature demodulation-based measurement scheme, already proposed by the authors. Thanks to an efficient measurement process and an optimized implementation of its fundamental stages, the proposed meter succeeds in exploiting all hardware resources in such an effective way as to gain high performance and real-time operation. For input frequencies up to some hundreds of kilohertz, the meter is capable both of updating phase noise power spectrum while seamlessly capturing the analyzed signal into its memory, and granting as good frequency resolution as few units of hertz.
Meshfree and efficient modeling of swimming cells
NASA Astrophysics Data System (ADS)
Gallagher, Meurig T.; Smith, David J.
2018-05-01
Locomotion in Stokes flow is an intensively studied problem because it describes important biological phenomena such as the motility of many species' sperm, bacteria, algae, and protozoa. Numerical computations can be challenging, particularly in three dimensions, due to the presence of moving boundaries and complex geometries; methods which combine ease of implementation and computational efficiency are therefore needed. A recently proposed method to discretize the regularized Stokeslet boundary integral equation without the need for a connected mesh is applied to the inertialess locomotion problem in Stokes flow. The mathematical formulation and key aspects of the computational implementation in matlab® or GNU Octave are described, followed by numerical experiments with biflagellate algae and multiple uniflagellate sperm swimming between no-slip surfaces, for which both swimming trajectories and flow fields are calculated. These computational experiments required minutes of time on modest hardware; an extensible implementation is provided in a GitHub repository. The nearest-neighbor discretization dramatically improves convergence and robustness, a key challenge in extending the regularized Stokeslet method to complicated three-dimensional biological fluid problems.
A Low-Cost and Energy-Efficient Multiprocessor System-on-Chip for UWB MAC Layer
NASA Astrophysics Data System (ADS)
Xiao, Hao; Isshiki, Tsuyoshi; Khan, Arif Ullah; Li, Dongju; Kunieda, Hiroaki; Nakase, Yuko; Kimura, Sadahiro
Ultra-wideband (UWB) technology has attracted much attention recently due to its high data rate and low emission power. Its media access control (MAC) protocol, WiMedia MAC, promises a lot of facilities for high-speed and high-quality wireless communication. However, these benefits in turn involve a large amount of computational load, which challenges the traditional uniprocessor architecture based implementation method to provide the required performance. However, the constrained cost and power budget, on the other hand, makes using commercial multiprocessor solutions unrealistic. In this paper, a low-cost and energy-efficient multiprocessor system-on-chip (MPSoC), which tackles at once the aspects of system design, software migration and hardware architecture, is presented for the implementation of UWB MAC layer. Experimental results show that the proposed MPSoC, based on four simple RISC processors and shared-memory infrastructure, achieves up to 45% performance improvement and 65% power saving, but takes 15% less area than the uniprocessor implementation.
Hardware Implementation of Lossless Adaptive and Scalable Hyperspectral Data Compression for Space
NASA Technical Reports Server (NTRS)
Aranki, Nazeeh; Keymeulen, Didier; Bakhshi, Alireza; Klimesh, Matthew
2009-01-01
On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. The technique also improves signature extraction, object recognition and feature classification capabilities by providing exact reconstructed data on constrained downlink resources. At JPL a novel, adaptive and predictive technique for lossless compression of hyperspectral data was recently developed. This technique uses an adaptive filtering method and achieves a combination of low complexity and compression effectiveness that far exceeds state-of-the-art techniques currently in use. The JPL-developed 'Fast Lossless' algorithm requires no training data or other specific information about the nature of the spectral bands for a fixed instrument dynamic range. It is of low computational complexity and thus well-suited for implementation in hardware. A modified form of the algorithm that is better suited for data from pushbroom instruments is generally appropriate for flight implementation. A scalable field programmable gate array (FPGA) hardware implementation was developed. The FPGA implementation achieves a throughput performance of 58 Msamples/sec, which can be increased to over 100 Msamples/sec in a parallel implementation that uses twice the hardware resources This paper describes the hardware implementation of the 'Modified Fast Lossless' compression algorithm on an FPGA. The FPGA implementation targets the current state-of-the-art FPGAs (Xilinx Virtex IV and V families) and compresses one sample every clock cycle to provide a fast and practical real-time solution for space applications.
NASA Technical Reports Server (NTRS)
Romine, Peter L.
1991-01-01
This final report documents the development and installation of software and hardware for Robotic Welding Process Control. Primary emphasis is on serial communications between the CYRO 750 robotic welder, Heurikon minicomputer running Hunter & Ready VRTX, and an IBM PC/AT, for offline programming and control and closed-loop welding control. The requirements for completion of the implementation of the Rocketdyne weld tracking control are discussed. The procedure for downloading programs from the Intergraph, over the network, is discussed. Conclusions are made on the results of this task, and recommendations are made for efficient implementation of communications, weld process control development, and advanced process control procedures using the Heurikon.
Linear precoding based on polynomial expansion: reducing complexity in massive MIMO.
Mueller, Axel; Kammoun, Abla; Björnson, Emil; Debbah, Mérouane
Massive multiple-input multiple-output (MIMO) techniques have the potential to bring tremendous improvements in spectral efficiency to future communication systems. Counterintuitively, the practical issues of having uncertain channel knowledge, high propagation losses, and implementing optimal non-linear precoding are solved more or less automatically by enlarging system dimensions. However, the computational precoding complexity grows with the system dimensions. For example, the close-to-optimal and relatively "antenna-efficient" regularized zero-forcing (RZF) precoding is very complicated to implement in practice, since it requires fast inversions of large matrices in every coherence period. Motivated by the high performance of RZF, we propose to replace the matrix inversion and multiplication by a truncated polynomial expansion (TPE), thereby obtaining the new TPE precoding scheme which is more suitable for real-time hardware implementation and significantly reduces the delay to the first transmitted symbol. The degree of the matrix polynomial can be adapted to the available hardware resources and enables smooth transition between simple maximum ratio transmission and more advanced RZF. By deriving new random matrix results, we obtain a deterministic expression for the asymptotic signal-to-interference-and-noise ratio (SINR) achieved by TPE precoding in massive MIMO systems. Furthermore, we provide a closed-form expression for the polynomial coefficients that maximizes this SINR. To maintain a fixed per-user rate loss as compared to RZF, the polynomial degree does not need to scale with the system, but it should be increased with the quality of the channel knowledge and the signal-to-noise ratio.
NASA Astrophysics Data System (ADS)
Lin, Zhuosheng; Yu, Simin; Li, Chengqing; Lü, Jinhu; Wang, Qianxue
This paper proposes a chaotic secure video remote communication scheme that can perform on real WAN networks, and implements it on a smartphone hardware platform. First, a joint encryption and compression scheme is designed by embedding a chaotic encryption scheme into the MJPG-Streamer source codes. Then, multiuser smartphone communications between the sender and the receiver are implemented via WAN remote transmission. Finally, the transmitted video data are received with the given IP address and port in an Android smartphone. It should be noted that, this is the first time that chaotic video encryption schemes are implemented on such a hardware platform. The experimental results demonstrate that the technical challenges on hardware implementation of secure video communication are successfully solved, reaching a balance amongst sufficient security level, real-time processing of massive video data, and utilization of available resources in the hardware environment. The proposed scheme can serve as a good application example of chaotic secure communications for smartphone and other mobile facilities in the future.
2010-03-01
to a graphics card , and not the redesign of XML. The justification is that if XML is going to be prevalent, special optimized hardware is...the answer, similar to the specialized functions of a video card . Given the Moore’s law that processing power doubles every few years, let the...and numerous multimedia players such as iTunes from Apple. These applications are free to use, but the source is restricted by software licenses
Rapid-X - An FPGA Development Toolset Using a Custom Simulink Library for MTCA.4 Modules
NASA Astrophysics Data System (ADS)
Prędki, Paweł; Heuer, Michael; Butkowski, Łukasz; Przygoda, Konrad; Schlarb, Holger; Napieralski, Andrzej
2015-06-01
The recent introduction of advanced hardware architectures such as the Micro Telecommunications Computing Architecture (MTCA) caused a change in the approach to implementation of control schemes in many fields. The development has been moving away from traditional programming languages ( C/C++), to hardware description languages (VHDL, Verilog), which are used in FPGA development. With MATLAB/Simulink it is possible to describe complex systems with block diagrams and simulate their behavior. Those diagrams are then used by the HDL experts to implement exactly the required functionality in hardware. Both the porting of existing applications and adaptation of new ones require a lot of development time from them. To solve this, Xilinx System Generator, a toolbox for MATLAB/Simulink, allows rapid prototyping of those block diagrams using hardware modelling. It is still up to the firmware developer to merge this structure with the hardware-dependent HDL project. This prevents the application engineer from quickly verifying the proposed schemes in real hardware. The framework described in this article overcomes these challenges, offering a hardware-independent library of components that can be used in Simulink/System Generator models. The components are subsequently translated into VHDL entities and integrated with a pre-prepared VHDL project template. Furthermore, the entire implementation process is run in the background, giving the user an almost one-click path from control scheme modelling and simulation to bit-file generation. This approach allows the application engineers to quickly develop new schemes and test them in real hardware environment. The applications may range from simple data logging or signal generation ones to very advanced controllers. Taking advantage of the Simulink simulation capabilities and user-friendly hardware implementation routines, the framework significantly decreases the development time of FPGA-based applications.
NASA Technical Reports Server (NTRS)
Lin, Shu; Fossorier, Marc
1998-01-01
The Viterbi algorithm is indeed a very simple and efficient method of implementing the maximum likelihood decoding. However, if we take advantage of the structural properties in a trellis section, other efficient trellis-based decoding algorithms can be devised. Recently, an efficient trellis-based recursive maximum likelihood decoding (RMLD) algorithm for linear block codes has been proposed. This algorithm is more efficient than the conventional Viterbi algorithm in both computation and hardware requirements. Most importantly, the implementation of this algorithm does not require the construction of the entire code trellis, only some special one-section trellises of relatively small state and branch complexities are needed for constructing path (or branch) metric tables recursively. At the end, there is only one table which contains only the most likely code-word and its metric for a given received sequence r = (r(sub 1), r(sub 2),...,r(sub n)). This algorithm basically uses the divide and conquer strategy. Furthermore, it allows parallel/pipeline processing of received sequences to speed up decoding.
The structure of the clouds distributed operating system
NASA Technical Reports Server (NTRS)
Dasgupta, Partha; Leblanc, Richard J., Jr.
1989-01-01
A novel system architecture, based on the object model, is the central structuring concept used in the Clouds distributed operating system. This architecture makes Clouds attractive over a wide class of machines and environments. Clouds is a native operating system, designed and implemented at Georgia Tech. and runs on a set of generated purpose computers connected via a local area network. The system architecture of Clouds is composed of a system-wide global set of persistent (long-lived) virtual address spaces, called objects that contain persistent data and code. The object concept is implemented at the operating system level, thus presenting a single level storage view to the user. Lightweight treads carry computational activity through the code stored in the objects. The persistent objects and threads gives rise to a programming environment composed of shared permanent memory, dispensing with the need for hardware-derived concepts such as the file systems and message systems. Though the hardware may be distributed and may have disks and networks, the Clouds provides the applications with a logically centralized system, based on a shared, structured, single level store. The current design of Clouds uses a minimalist philosophy with respect to both the kernel and the operating system. That is, the kernel and the operating system support a bare minimum of functionality. Clouds also adheres to the concept of separation of policy and mechanism. Most low-level operating system services are implemented above the kernel and most high level services are implemented at the user level. From the measured performance of using the kernel mechanisms, we are able to demonstrate that efficient implementations are feasible for the object model on commercially available hardware. Clouds provides a rich environment for conducting research in distributed systems. Some of the topics addressed in this paper include distributed programming environments, consistency of persistent data and fault-tolerance.
Bar Coding and Tracking in Pathology.
Hanna, Matthew G; Pantanowitz, Liron
2016-03-01
Bar coding and specimen tracking are intricately linked to pathology workflow and efficiency. In the pathology laboratory, bar coding facilitates many laboratory practices, including specimen tracking, automation, and quality management. Data obtained from bar coding can be used to identify, locate, standardize, and audit specimens to achieve maximal laboratory efficiency and patient safety. Variables that need to be considered when implementing and maintaining a bar coding and tracking system include assets to be labeled, bar code symbologies, hardware, software, workflow, and laboratory and information technology infrastructure as well as interoperability with the laboratory information system. This article addresses these issues, primarily focusing on surgical pathology. Copyright © 2016 Elsevier Inc. All rights reserved.
Structured Low-Density Parity-Check Codes with Bandwidth Efficient Modulation
NASA Technical Reports Server (NTRS)
Cheng, Michael K.; Divsalar, Dariush; Duy, Stephanie
2009-01-01
In this work, we study the performance of structured Low-Density Parity-Check (LDPC) Codes together with bandwidth efficient modulations. We consider protograph-based LDPC codes that facilitate high-speed hardware implementations and have minimum distances that grow linearly with block sizes. We cover various higher- order modulations such as 8-PSK, 16-APSK, and 16-QAM. During demodulation, a demapper transforms the received in-phase and quadrature samples into reliability information that feeds the binary LDPC decoder. We will compare various low-complexity demappers and provide simulation results for assorted coded-modulation combinations on the additive white Gaussian noise and independent Rayleigh fading channels.
Bar Coding and Tracking in Pathology.
Hanna, Matthew G; Pantanowitz, Liron
2015-06-01
Bar coding and specimen tracking are intricately linked to pathology workflow and efficiency. In the pathology laboratory, bar coding facilitates many laboratory practices, including specimen tracking, automation, and quality management. Data obtained from bar coding can be used to identify, locate, standardize, and audit specimens to achieve maximal laboratory efficiency and patient safety. Variables that need to be considered when implementing and maintaining a bar coding and tracking system include assets to be labeled, bar code symbologies, hardware, software, workflow, and laboratory and information technology infrastructure as well as interoperability with the laboratory information system. This article addresses these issues, primarily focusing on surgical pathology. Copyright © 2015 Elsevier Inc. All rights reserved.
Emergent Auditory Feature Tuning in a Real-Time Neuromorphic VLSI System.
Sheik, Sadique; Coath, Martin; Indiveri, Giacomo; Denham, Susan L; Wennekers, Thomas; Chicca, Elisabetta
2012-01-01
Many sounds of ecological importance, such as communication calls, are characterized by time-varying spectra. However, most neuromorphic auditory models to date have focused on distinguishing mainly static patterns, under the assumption that dynamic patterns can be learned as sequences of static ones. In contrast, the emergence of dynamic feature sensitivity through exposure to formative stimuli has been recently modeled in a network of spiking neurons based on the thalamo-cortical architecture. The proposed network models the effect of lateral and recurrent connections between cortical layers, distance-dependent axonal transmission delays, and learning in the form of Spike Timing Dependent Plasticity (STDP), which effects stimulus-driven changes in the pattern of network connectivity. In this paper we demonstrate how these principles can be efficiently implemented in neuromorphic hardware. In doing so we address two principle problems in the design of neuromorphic systems: real-time event-based asynchronous communication in multi-chip systems, and the realization in hybrid analog/digital VLSI technology of neural computational principles that we propose underlie plasticity in neural processing of dynamic stimuli. The result is a hardware neural network that learns in real-time and shows preferential responses, after exposure, to stimuli exhibiting particular spectro-temporal patterns. The availability of hardware on which the model can be implemented, makes this a significant step toward the development of adaptive, neurobiologically plausible, spike-based, artificial sensory systems.
Emergent Auditory Feature Tuning in a Real-Time Neuromorphic VLSI System
Sheik, Sadique; Coath, Martin; Indiveri, Giacomo; Denham, Susan L.; Wennekers, Thomas; Chicca, Elisabetta
2011-01-01
Many sounds of ecological importance, such as communication calls, are characterized by time-varying spectra. However, most neuromorphic auditory models to date have focused on distinguishing mainly static patterns, under the assumption that dynamic patterns can be learned as sequences of static ones. In contrast, the emergence of dynamic feature sensitivity through exposure to formative stimuli has been recently modeled in a network of spiking neurons based on the thalamo-cortical architecture. The proposed network models the effect of lateral and recurrent connections between cortical layers, distance-dependent axonal transmission delays, and learning in the form of Spike Timing Dependent Plasticity (STDP), which effects stimulus-driven changes in the pattern of network connectivity. In this paper we demonstrate how these principles can be efficiently implemented in neuromorphic hardware. In doing so we address two principle problems in the design of neuromorphic systems: real-time event-based asynchronous communication in multi-chip systems, and the realization in hybrid analog/digital VLSI technology of neural computational principles that we propose underlie plasticity in neural processing of dynamic stimuli. The result is a hardware neural network that learns in real-time and shows preferential responses, after exposure, to stimuli exhibiting particular spectro-temporal patterns. The availability of hardware on which the model can be implemented, makes this a significant step toward the development of adaptive, neurobiologically plausible, spike-based, artificial sensory systems. PMID:22347163
NASA Astrophysics Data System (ADS)
Luo, Lin-Bo; An, Sang-Woo; Wang, Chang-Shuai; Li, Ying-Chun; Chong, Jong-Wha
2012-09-01
Digital cameras usually decrease exposure time to capture motion-blur-free images. However, this operation will generate an under-exposed image with a low-budget complementary metal-oxide semiconductor image sensor (CIS). Conventional color correction algorithms can efficiently correct under-exposed images; however, they are generally not performed in real time and need at least one frame memory if they are implemented by hardware. The authors propose a real-time look-up table-based color correction method that corrects under-exposed images with hardware without using frame memory. The method utilizes histogram matching of two preview images, which are exposed for a long and short time, respectively, to construct an improved look-up table (ILUT) and then corrects the captured under-exposed image in real time. Because the ILUT is calculated in real time before processing the captured image, this method does not require frame memory to buffer image data, and therefore can greatly save the cost of CIS. This method not only supports single image capture, but also bracketing to capture three images at a time. The proposed method was implemented by hardware description language and verified by a field-programmable gate array with a 5 M CIS. Simulations show that the system can perform in real time with a low cost and can correct the color of under-exposed images well.
Independent component analysis algorithm FPGA design to perform real-time blind source separation
NASA Astrophysics Data System (ADS)
Meyer-Baese, Uwe; Odom, Crispin; Botella, Guillermo; Meyer-Baese, Anke
2015-05-01
The conditions that arise in the Cocktail Party Problem prevail across many fields creating a need for of Blind Source Separation. The need for BSS has become prevalent in several fields of work. These fields include array processing, communications, medical signal processing, and speech processing, wireless communication, audio, acoustics and biomedical engineering. The concept of the cocktail party problem and BSS led to the development of Independent Component Analysis (ICA) algorithms. ICA proves useful for applications needing real time signal processing. The goal of this research was to perform an extensive study on ability and efficiency of Independent Component Analysis algorithms to perform blind source separation on mixed signals in software and implementation in hardware with a Field Programmable Gate Array (FPGA). The Algebraic ICA (A-ICA), Fast ICA, and Equivariant Adaptive Separation via Independence (EASI) ICA were examined and compared. The best algorithm required the least complexity and fewest resources while effectively separating mixed sources. The best algorithm was the EASI algorithm. The EASI ICA was implemented on hardware with Field Programmable Gate Arrays (FPGA) to perform and analyze its performance in real time.
A spiking neural network model of 3D perception for event-based neuromorphic stereo vision systems
Osswald, Marc; Ieng, Sio-Hoi; Benosman, Ryad; Indiveri, Giacomo
2017-01-01
Stereo vision is an important feature that enables machine vision systems to perceive their environment in 3D. While machine vision has spawned a variety of software algorithms to solve the stereo-correspondence problem, their implementation and integration in small, fast, and efficient hardware vision systems remains a difficult challenge. Recent advances made in neuromorphic engineering offer a possible solution to this problem, with the use of a new class of event-based vision sensors and neural processing devices inspired by the organizing principles of the brain. Here we propose a radically novel model that solves the stereo-correspondence problem with a spiking neural network that can be directly implemented with massively parallel, compact, low-latency and low-power neuromorphic engineering devices. We validate the model with experimental results, highlighting features that are in agreement with both computational neuroscience stereo vision theories and experimental findings. We demonstrate its features with a prototype neuromorphic hardware system and provide testable predictions on the role of spike-based representations and temporal dynamics in biological stereo vision processing systems. PMID:28079187
A spiking neural network model of 3D perception for event-based neuromorphic stereo vision systems.
Osswald, Marc; Ieng, Sio-Hoi; Benosman, Ryad; Indiveri, Giacomo
2017-01-12
Stereo vision is an important feature that enables machine vision systems to perceive their environment in 3D. While machine vision has spawned a variety of software algorithms to solve the stereo-correspondence problem, their implementation and integration in small, fast, and efficient hardware vision systems remains a difficult challenge. Recent advances made in neuromorphic engineering offer a possible solution to this problem, with the use of a new class of event-based vision sensors and neural processing devices inspired by the organizing principles of the brain. Here we propose a radically novel model that solves the stereo-correspondence problem with a spiking neural network that can be directly implemented with massively parallel, compact, low-latency and low-power neuromorphic engineering devices. We validate the model with experimental results, highlighting features that are in agreement with both computational neuroscience stereo vision theories and experimental findings. We demonstrate its features with a prototype neuromorphic hardware system and provide testable predictions on the role of spike-based representations and temporal dynamics in biological stereo vision processing systems.
A spiking neural network model of 3D perception for event-based neuromorphic stereo vision systems
NASA Astrophysics Data System (ADS)
Osswald, Marc; Ieng, Sio-Hoi; Benosman, Ryad; Indiveri, Giacomo
2017-01-01
Stereo vision is an important feature that enables machine vision systems to perceive their environment in 3D. While machine vision has spawned a variety of software algorithms to solve the stereo-correspondence problem, their implementation and integration in small, fast, and efficient hardware vision systems remains a difficult challenge. Recent advances made in neuromorphic engineering offer a possible solution to this problem, with the use of a new class of event-based vision sensors and neural processing devices inspired by the organizing principles of the brain. Here we propose a radically novel model that solves the stereo-correspondence problem with a spiking neural network that can be directly implemented with massively parallel, compact, low-latency and low-power neuromorphic engineering devices. We validate the model with experimental results, highlighting features that are in agreement with both computational neuroscience stereo vision theories and experimental findings. We demonstrate its features with a prototype neuromorphic hardware system and provide testable predictions on the role of spike-based representations and temporal dynamics in biological stereo vision processing systems.
Large-scale virtual screening on public cloud resources with Apache Spark.
Capuccini, Marco; Ahmed, Laeeq; Schaal, Wesley; Laure, Erwin; Spjuth, Ola
2017-01-01
Structure-based virtual screening is an in-silico method to screen a target receptor against a virtual molecular library. Applying docking-based screening to large molecular libraries can be computationally expensive, however it constitutes a trivially parallelizable task. Most of the available parallel implementations are based on message passing interface, relying on low failure rate hardware and fast network connection. Google's MapReduce revolutionized large-scale analysis, enabling the processing of massive datasets on commodity hardware and cloud resources, providing transparent scalability and fault tolerance at the software level. Open source implementations of MapReduce include Apache Hadoop and the more recent Apache Spark. We developed a method to run existing docking-based screening software on distributed cloud resources, utilizing the MapReduce approach. We benchmarked our method, which is implemented in Apache Spark, docking a publicly available target receptor against [Formula: see text]2.2 M compounds. The performance experiments show a good parallel efficiency (87%) when running in a public cloud environment. Our method enables parallel Structure-based virtual screening on public cloud resources or commodity computer clusters. The degree of scalability that we achieve allows for trying out our method on relatively small libraries first and then to scale to larger libraries. Our implementation is named Spark-VS and it is freely available as open source from GitHub (https://github.com/mcapuccini/spark-vs).Graphical abstract.
Implementation of the 2-D Wavelet Transform into FPGA for Image
NASA Astrophysics Data System (ADS)
León, M.; Barba, L.; Vargas, L.; Torres, C. O.
2011-01-01
This paper presents a hardware system implementation of the of discrete wavelet transform algoritm in two dimensions for FPGA, using the Daubechies filter family of order 2 (db2). The decomposition algorithm of this transform is designed and simulated with the Hardware Description Language VHDL and is implemented in a programmable logic device (FPGA) XC3S1200E reference, Spartan IIIE family, by Xilinx, take advantage the parallels properties of these gives us and speeds processing that can reach them. The architecture is evaluated using images input of different sizes. This implementation is done with the aim of developing a future images encryption hardware system using wavelet transform for security information.
The Evolution of Exercise Hardware on ISS: Past, Present, and Future
NASA Technical Reports Server (NTRS)
Buxton, R. E.; Kalogera, K. L.; Hanson, A. M.
2017-01-01
During 16 years in low-Earth orbit, the suite of exercise hardware aboard the International Space Station (ISS) has matured significantly. Today, the countermeasure system supports an array of physical-training protocols and serves as an extensive research platform. Future hardware designs are required to have smaller operational envelopes and must also mitigate known physiologic issues observed in long-duration spaceflight. Taking lessons learned from the long history of space exercise will be important to successful development and implementation of future, compact exercise hardware. The evolution of exercise hardware as deployed on the ISS has implications for future exercise hardware and operations. Key lessons learned from the early days of ISS have helped to: 1. Enhance hardware performance (increased speed and loads). 2. Mature software interfaces. 3. Compare inflight exercise workloads to pre-, in-, and post-flight musculoskeletal and aerobic conditions. 4. Improve exercise comfort. 5. Develop complimentary hardware for research and operations. Current ISS exercise hardware includes both custom and commercial-off-the-shelf (COTS) hardware. Benefits and challenges to this approach have prepared engineering teams to take a hybrid approach when designing and implementing future exercise hardware. Significant effort has gone into consideration of hardware instrumentation and wearable devices that provide important data to monitor crew health and performance.
Generalized reconfigurable memristive dynamical system (MDS) for neuromorphic applications
Bavandpour, Mohammad; Soleimani, Hamid; Linares-Barranco, Bernabé; Abbott, Derek; Chua, Leon O.
2015-01-01
This study firstly presents (i) a novel general cellular mapping scheme for two dimensional neuromorphic dynamical systems such as bio-inspired neuron models, and (ii) an efficient mixed analog-digital circuit, which can be conveniently implemented on a hybrid memristor-crossbar/CMOS platform, for hardware implementation of the scheme. This approach employs 4n memristors and no switch for implementing an n-cell system in comparison with 2n2 memristors and 2n switches of a Cellular Memristive Dynamical System (CMDS). Moreover, this approach allows for dynamical variables with both analog and one-hot digital values opening a wide range of choices for interconnections and networking schemes. Dynamical response analyses show that this circuit exhibits various responses based on the underlying bifurcation scenarios which determine the main characteristics of the neuromorphic dynamical systems. Due to high programmability of the circuit, it can be applied to a variety of learning systems, real-time applications, and analytically indescribable dynamical systems. We simulate the FitzHugh-Nagumo (FHN), Adaptive Exponential (AdEx) integrate and fire, and Izhikevich neuron models on our platform, and investigate the dynamical behaviors of these circuits as case studies. Moreover, error analysis shows that our approach is suitably accurate. We also develop a simple hardware prototype for experimental demonstration of our approach. PMID:26578867
Generalized reconfigurable memristive dynamical system (MDS) for neuromorphic applications.
Bavandpour, Mohammad; Soleimani, Hamid; Linares-Barranco, Bernabé; Abbott, Derek; Chua, Leon O
2015-01-01
This study firstly presents (i) a novel general cellular mapping scheme for two dimensional neuromorphic dynamical systems such as bio-inspired neuron models, and (ii) an efficient mixed analog-digital circuit, which can be conveniently implemented on a hybrid memristor-crossbar/CMOS platform, for hardware implementation of the scheme. This approach employs 4n memristors and no switch for implementing an n-cell system in comparison with 2n (2) memristors and 2n switches of a Cellular Memristive Dynamical System (CMDS). Moreover, this approach allows for dynamical variables with both analog and one-hot digital values opening a wide range of choices for interconnections and networking schemes. Dynamical response analyses show that this circuit exhibits various responses based on the underlying bifurcation scenarios which determine the main characteristics of the neuromorphic dynamical systems. Due to high programmability of the circuit, it can be applied to a variety of learning systems, real-time applications, and analytically indescribable dynamical systems. We simulate the FitzHugh-Nagumo (FHN), Adaptive Exponential (AdEx) integrate and fire, and Izhikevich neuron models on our platform, and investigate the dynamical behaviors of these circuits as case studies. Moreover, error analysis shows that our approach is suitably accurate. We also develop a simple hardware prototype for experimental demonstration of our approach.
CCC7-119 Reactive Molecular Dynamics Simulations of Hot Spot Growth in Shocked Energetic Materials
DOE Office of Scientific and Technical Information (OSTI.GOV)
Thompson, Aidan P.
2015-03-01
The purpose of this work is to understand how defects control initiation in energetic materials used in stockpile components; Sequoia gives us the core-count to run very large-scale simulations of up to 10 million atoms and; Using an OpenMP threaded implementation of the ReaxFF package in LAMMPS, we have been able to get good parallel efficiency running on 16k nodes of Sequoia, with 1 hardware thread per core.
Interactive wall turbulence control
NASA Technical Reports Server (NTRS)
Wilkinson, Stephen P.
1990-01-01
After presenting boundary layer turbulence physics in a manner that emphasizes the possible modification of structural surfaces in a way that locally alters the production of turbulent flows, an account is given of the hardware that could plausibly be employed to implement such a turbulence-control scheme. The essential system components are flow sensors, electronic processors, and actuators; at present, actuator technology presents the greatest problems and limitations. High frequency/efficiency actuators are required to handle three-dimensional turbulent motions whose frequency and intensity increases in approximate proportion to freestream speed.
Development of a demand assignment/TDMA system for international business satellite communications
NASA Astrophysics Data System (ADS)
Nohara, Mitsuo; Takeuchi, Yoshio; Takahata, Fumio; Hirata, Yasuo; Yamazaki, Yoshiharu
An experimental IBS (international business satellite) communications system based on a demand assignment and TDMA (time-division multiple-access) operation has been developed. The system utilizes a limited satellite resource efficiently and provides various kinds of ISDN services totally. A discussion is presented of the IBS network configurations suitable to international communications and describes the developed communications system from the viewpoint of the hardware and software implementation. The performance in terms of the transmission quality and call processing is also demonstrated.
NASA Astrophysics Data System (ADS)
An, Fengwei; Akazawa, Toshinobu; Yamasaki, Shogo; Chen, Lei; Jürgen Mattausch, Hans
2015-04-01
This paper reports a VLSI realization of learning vector quantization (LVQ) with high flexibility for different applications. It is based on a hardware/software (HW/SW) co-design concept for on-chip learning and recognition and designed as a SoC in 180 nm CMOS. The time consuming nearest Euclidean distance search in the LVQ algorithm’s competition layer is efficiently implemented as a pipeline with parallel p-word input. Since neuron number in the competition layer, weight values, input and output number are scalable, the requirements of many different applications can be satisfied without hardware changes. Classification of a d-dimensional input vector is completed in n × \\lceil d/p \\rceil + R clock cycles, where R is the pipeline depth, and n is the number of reference feature vectors (FVs). Adjustment of stored reference FVs during learning is done by the embedded 32-bit RISC CPU, because this operation is not time critical. The high flexibility is verified by the application of human detection with different numbers for the dimensionality of the FVs.
Reconfigurable vision system for real-time applications
NASA Astrophysics Data System (ADS)
Torres-Huitzil, Cesar; Arias-Estrada, Miguel
2002-03-01
Recently, a growing community of researchers has used reconfigurable systems to solve computationally intensive problems. Reconfigurability provides optimized processors for systems on chip designs, and makes easy to import technology to a new system through reusable modules. The main objective of this work is the investigation of a reconfigurable computer system targeted for computer vision and real-time applications. The system is intended to circumvent the inherent computational load of most window-based computer vision algorithms. It aims to build a system for such tasks by providing an FPGA-based hardware architecture for task specific vision applications with enough processing power, using the minimum amount of hardware resources as possible, and a mechanism for building systems using this architecture. Regarding the software part of the system, a library of pre-designed and general-purpose modules that implement common window-based computer vision operations is being investigated. A common generic interface is established for these modules in order to define hardware/software components. These components can be interconnected to develop more complex applications, providing an efficient mechanism for transferring image and result data among modules. Some preliminary results are presented and discussed.
Synthetic Foveal Imaging Technology
NASA Technical Reports Server (NTRS)
Nikzad, Shouleh (Inventor); Monacos, Steve P. (Inventor); Hoenk, Michael E. (Inventor)
2013-01-01
Apparatuses and methods are disclosed that create a synthetic fovea in order to identify and highlight interesting portions of an image for further processing and rapid response. Synthetic foveal imaging implements a parallel processing architecture that uses reprogrammable logic to implement embedded, distributed, real-time foveal image processing from different sensor types while simultaneously allowing for lossless storage and retrieval of raw image data. Real-time, distributed, adaptive processing of multi-tap image sensors with coordinated processing hardware used for each output tap is enabled. In mosaic focal planes, a parallel-processing network can be implemented that treats the mosaic focal plane as a single ensemble rather than a set of isolated sensors. Various applications are enabled for imaging and robotic vision where processing and responding to enormous amounts of data quickly and efficiently is important.
An emulator for minimizing finite element analysis implementation resources
NASA Technical Reports Server (NTRS)
Melosh, R. J.; Utku, S.; Salama, M.; Islam, M.
1982-01-01
A finite element analysis emulator providing a basis for efficiently establishing an optimum computer implementation strategy when many calculations are involved is described. The SCOPE emulator determines computer resources required as a function of the structural model, structural load-deflection equation characteristics, the storage allocation plan, and computer hardware capabilities. Thereby, it provides data for trading analysis implementation options to arrive at a best strategy. The models contained in SCOPE lead to micro-operation computer counts of each finite element operation as well as overall computer resource cost estimates. Application of SCOPE to the Memphis-Arkansas bridge analysis provides measures of the accuracy of resource assessments. Data indicate that predictions are within 17.3 percent for calculation times and within 3.2 percent for peripheral storage resources for the ELAS code.
NASA Astrophysics Data System (ADS)
Pfeil, Thomas; Jordan, Jakob; Tetzlaff, Tom; Grübl, Andreas; Schemmel, Johannes; Diesmann, Markus; Meier, Karlheinz
2016-04-01
High-level brain function, such as memory, classification, or reasoning, can be realized by means of recurrent networks of simplified model neurons. Analog neuromorphic hardware constitutes a fast and energy-efficient substrate for the implementation of such neural computing architectures in technical applications and neuroscientific research. The functional performance of neural networks is often critically dependent on the level of correlations in the neural activity. In finite networks, correlations are typically inevitable due to shared presynaptic input. Recent theoretical studies have shown that inhibitory feedback, abundant in biological neural networks, can actively suppress these shared-input correlations and thereby enable neurons to fire nearly independently. For networks of spiking neurons, the decorrelating effect of inhibitory feedback has so far been explicitly demonstrated only for homogeneous networks of neurons with linear subthreshold dynamics. Theory, however, suggests that the effect is a general phenomenon, present in any system with sufficient inhibitory feedback, irrespective of the details of the network structure or the neuronal and synaptic properties. Here, we investigate the effect of network heterogeneity on correlations in sparse, random networks of inhibitory neurons with nonlinear, conductance-based synapses. Emulations of these networks on the analog neuromorphic-hardware system Spikey allow us to test the efficiency of decorrelation by inhibitory feedback in the presence of hardware-specific heterogeneities. The configurability of the hardware substrate enables us to modulate the extent of heterogeneity in a systematic manner. We selectively study the effects of shared input and recurrent connections on correlations in membrane potentials and spike trains. Our results confirm that shared-input correlations are actively suppressed by inhibitory feedback also in highly heterogeneous networks exhibiting broad, heavy-tailed firing-rate distributions. In line with former studies, cell heterogeneities reduce shared-input correlations. Overall, however, correlations in the recurrent system can increase with the level of heterogeneity as a consequence of diminished effective negative feedback.
Malleable architecture generator for FPGA computing
NASA Astrophysics Data System (ADS)
Gokhale, Maya; Kaba, James; Marks, Aaron; Kim, Jang
1996-10-01
The malleable architecture generator (MARGE) is a tool set that translates high-level parallel C to configuration bit streams for field-programmable logic based computing systems. MARGE creates an application-specific instruction set and generates the custom hardware components required to perform exactly those computations specified by the C program. In contrast to traditional fixed-instruction processors, MARGE's dynamic instruction set creation provides for efficient use of hardware resources. MARGE processes intermediate code in which each operation is annotated by the bit lengths of the operands. Each basic block (sequence of straight line code) is mapped into a single custom instruction which contains all the operations and logic inherent in the block. A synthesis phase maps the operations comprising the instructions into register transfer level structural components and control logic which have been optimized to exploit functional parallelism and function unit reuse. As a final stage, commercial technology-specific tools are used to generate configuration bit streams for the desired target hardware. Technology- specific pre-placed, pre-routed macro blocks are utilized to implement as much of the hardware as possible. MARGE currently supports the Xilinx-based Splash-2 reconfigurable accelerator and National Semiconductor's CLAy-based parallel accelerator, MAPA. The MARGE approach has been demonstrated on systolic applications such as DNA sequence comparison.
An Agent Inspired Reconfigurable Computing Implementation of a Genetic Algorithm
NASA Technical Reports Server (NTRS)
Weir, John M.; Wells, B. Earl
2003-01-01
Many software systems have been successfully implemented using an agent paradigm which employs a number of independent entities that communicate with one another to achieve a common goal. The distributed nature of such a paradigm makes it an excellent candidate for use in high speed reconfigurable computing hardware environments such as those present in modem FPGA's. In this paper, a distributed genetic algorithm that can be applied to the agent based reconfigurable hardware model is introduced. The effectiveness of this new algorithm is evaluated by comparing the quality of the solutions found by the new algorithm with those found by traditional genetic algorithms. The performance of a reconfigurable hardware implementation of the new algorithm on an FPGA is compared to traditional single processor implementations.
FPGA based hardware optimized implementation of signal processing system for LFM pulsed radar
NASA Astrophysics Data System (ADS)
Azim, Noor ul; Jun, Wang
2016-11-01
Signal processing is one of the main parts of any radar system. Different signal processing algorithms are used to extract information about different parameters like range, speed, direction etc, of a target in the field of radar communication. This paper presents LFM (Linear Frequency Modulation) pulsed radar signal processing algorithms which are used to improve target detection, range resolution and to estimate the speed of a target. Firstly, these algorithms are simulated in MATLAB to verify the concept and theory. After the conceptual verification in MATLAB, the simulation is converted into implementation on hardware using Xilinx FPGA. Chosen FPGA is Xilinx Virtex-6 (XC6LVX75T). For hardware implementation pipeline optimization is adopted and also other factors are considered for resources optimization in the process of implementation. Focusing algorithms in this work for improving target detection, range resolution and speed estimation are hardware optimized fast convolution processing based pulse compression and pulse Doppler processing.
Cosmic-ray discrimination capabilities of /ΔE-/E silicon nuclear telescopes using neural networks
NASA Astrophysics Data System (ADS)
Ambriola, M.; Bellotti, R.; Cafagna, F.; Castellano, M.; Ciacio, F.; Circella, M.; Marzo, C. N. D.; Montaruli, T.
2000-02-01
An isotope classifier of cosmic-ray events collected by space detectors has been implemented using a multi-layer perceptron neural architecture. In order to handle a great number of different isotopes a modular architecture of the ``mixture of experts'' type is proposed. The performance of this classifier has been tested on simulated data and has been compared with a ``classical'' classifying procedure. The quantitative comparison with traditional techniques shows that the neural approach has classification performances comparable - within /1% - with that of the classical one, with efficiency of the order of /98%. A possible hardware implementation of such a kind of neural architecture in future space missions is considered.
Blind detection of giant pulses: GPU implementation
NASA Astrophysics Data System (ADS)
Ait-Allal, Dalal; Weber, Rodolphe; Dumez-Viou, Cédric; Cognard, Ismael; Theureau, Gilles
2012-01-01
Radio astronomical pulsar observations require specific instrumentation and dedicated signal processing to cope with the dispersion caused by the interstellar medium. Moreover, the quality of observations can be limited by radio frequency interference (RFI) generated by Telecommunications activity. This article presents the innovative pulsar instrumentation based on graphical processing units (GPU) which has been designed at the Nançay Radio Astronomical Observatory. In addition, for giant pulsar search, we propose a new approach which combines a hardware-efficient search method and some RFI mitigation capabilities. Although this approach is less sensitive than the classical approach, its advantage is that no a priori information on the pulsar parameters is required. The validation of a GPU implementation is under way.
Benchmarking of Computational Models for NDE and SHM of Composites
NASA Technical Reports Server (NTRS)
Wheeler, Kevin; Leckey, Cara; Hafiychuk, Vasyl; Juarez, Peter; Timucin, Dogan; Schuet, Stefan; Hafiychuk, Halyna
2016-01-01
Ultrasonic wave phenomena constitute the leading physical mechanism for nondestructive evaluation (NDE) and structural health monitoring (SHM) of solid composite materials such as carbon-fiber-reinforced polymer (CFRP) laminates. Computational models of ultrasonic guided-wave excitation, propagation, scattering, and detection in quasi-isotropic laminates can be extremely valuable in designing practically realizable NDE and SHM hardware and software with desired accuracy, reliability, efficiency, and coverage. This paper presents comparisons of guided-wave simulations for CFRP composites implemented using three different simulation codes: two commercial finite-element analysis packages, COMSOL and ABAQUS, and a custom code implementing the Elastodynamic Finite Integration Technique (EFIT). Comparisons are also made to experimental laser Doppler vibrometry data and theoretical dispersion curves.
Correia, J R C C C; Martins, C J A P
2017-10-01
Topological defects unavoidably form at symmetry breaking phase transitions in the early universe. To probe the parameter space of theoretical models and set tighter experimental constraints (exploiting the recent advances in astrophysical observations), one requires more and more demanding simulations, and therefore more hardware resources and computation time. Improving the speed and efficiency of existing codes is essential. Here we present a general purpose graphics-processing-unit implementation of the canonical Press-Ryden-Spergel algorithm for the evolution of cosmological domain wall networks. This is ported to the Open Computing Language standard, and as a consequence significant speedups are achieved both in two-dimensional (2D) and 3D simulations.
NASA Astrophysics Data System (ADS)
Machnes, Shai; AsséMat, Elie; Tannor, David; Wilhelm, Frank
Quantum computation places very stringent demands on gate fidelities, and experimental implementations require both the controls and the resultant dynamics to conform to hardware-specific ansatzes and constraints. Superconducting qubits present the additional requirement that pulses have simple parametrizations, so they can be further calibrated in the experiment, to compensate for uncertainties in system characterization. We present a novel, conceptually simple and easy-to-implement gradient-based optimal control algorithm, GOAT, which satisfies all the above requirements. In part II we shall demonstrate the algorithm's capabilities, by using GOAT to optimize fast high-accuracy pulses for two leading superconducting qubits architectures - Xmons and IBM's flux-tunable couplers.
Implementation of COTs Hardware in Non-Critical Space Applications: A Brief Tutorial
NASA Technical Reports Server (NTRS)
Yoder, Geoffrey L.
2004-01-01
Approaches used for manned applications include limited items such as CD-players evaluated for safety to high criticality applications where the COTs hardware is evaluated on a case-by-case basis for the application and commensurate screening and qualification testing. COTS hardware is successfully implemented in both the International Space Station and Space Shuttle but requires evaluation and modifications for the application. Screening and qualification of COTs hardware used in critical applications may need to be more extensive and stringent than traditional military screening. Evaluation for: a) Suitability for the application; b) Safety; c) Reliability and maintainability; and d) Workmanship.
Applying a Genetic Algorithm to Reconfigurable Hardware
NASA Technical Reports Server (NTRS)
Wells, B. Earl; Weir, John; Trevino, Luis; Patrick, Clint; Steincamp, Jim
2004-01-01
This paper investigates the feasibility of applying genetic algorithms to solve optimization problems that are implemented entirely in reconfgurable hardware. The paper highlights the pe$ormance/design space trade-offs that must be understood to effectively implement a standard genetic algorithm within a modem Field Programmable Gate Array, FPGA, reconfgurable hardware environment and presents a case-study where this stochastic search technique is applied to standard test-case problems taken from the technical literature. In this research, the targeted FPGA-based platform and high-level design environment was the Starbridge Hypercomputing platform, which incorporates multiple Xilinx Virtex II FPGAs, and the Viva TM graphical hardware description language.
Helix Project Testbed - Towards the Self-Regenerative Incorruptible Enterprise
2011-09-14
hardware implementation with a microkernel in a way that allows information flow properties of the entire construction to be statically verified all the way...secure architectural skeleton. This skeleton couples a critical slice of the low level hardware implementation with a microkernel in a way that
Finding a roadmap to achieve large neuromorphic hardware systems
Hasler, Jennifer; Marr, Bo
2013-01-01
Neuromorphic systems are gaining increasing importance in an era where CMOS digital computing techniques are reaching physical limits. These silicon systems mimic extremely energy efficient neural computing structures, potentially both for solving engineering applications as well as understanding neural computation. Toward this end, the authors provide a glimpse at what the technology evolution roadmap looks like for these systems so that Neuromorphic engineers may gain the same benefit of anticipation and foresight that IC designers gained from Moore's law many years ago. Scaling of energy efficiency, performance, and size will be discussed as well as how the implementation and application space of Neuromorphic systems are expected to evolve over time. PMID:24058330
Orżanowski, Tomasz
2016-01-01
This paper presents an infrared focal plane array (IRFPA) response nonuniformity correction (NUC) algorithm which is easy to implement by hardware. The proposed NUC algorithm is based on the linear correction scheme with the useful method of pixel offset correction coefficients update. The new approach to IRFPA response nonuniformity correction consists in the use of pixel response change determined at the actual operating conditions in relation to the reference ones by means of shutter to compensate a pixel offset temporal drift. Moreover, it permits to remove any optics shading effect in the output image as well. To show efficiency of the proposed NUC algorithm some test results for microbolometer IRFPA are presented.
Newman, D M; Hawley, R W; Goeckel, D L; Crawford, R D; Abraham, S; Gallagher, N C
1993-05-10
An efficient storage format was developed for computer-generated holograms for use in electron-beam lithography. This method employs run-length encoding and Lempel-Ziv-Welch compression and succeeds in exposing holograms that were previously infeasible owing to the hologram's tremendous pattern-data file size. These holograms also require significant computation; thus the algorithm was implemented on a parallel computer, which improved performance by 2 orders of magnitude. The decompression algorithm was integrated into the Cambridge electron-beam machine's front-end processor.Although this provides much-needed ability, some hardware enhancements will be required in the future to overcome inadequacies in the current front-end processor that result in a lengthy exposure time.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bobrek, Miljko; Albright, Austin P
This paper presents FPGA implementation of the Reed-Solomon decoder for use in IEEE 802.16 WiMAX systems. The decoder is based on RS(255,239) code, and is additionally shortened and punctured according to the WiMAX specifications. Simulink model based on Sysgen library of Xilinx blocks was used for simulation and hardware implementation. At the end, simulation results and hardware implementation performances are presented.
Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs
NASA Astrophysics Data System (ADS)
Mawson, Mark J.; Revell, Alistair J.
2014-10-01
The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient implementation for massively parallel computing, due to the prevalence of local operations in the algorithm. This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as 'Kepler'. We provide a review of previous optimization strategies and analyse data read/write times for different memory types. In LBM, the time propagation step (known as streaming), involves shifting data to adjacent locations and is central to parallel performance; here we examine three approaches which make use of different hardware options. Two of which make use of 'performance enhancing' features of the GPU; shared memory and the new shuffle instruction found in Kepler based GPUs. These are compared to a standard transfer of data which relies instead on optimized storage to increase coalesced access. It is shown that the more simple approach is most efficient; since the need for large numbers of registers per thread in LBM limits the block size and thus the efficiency of these special features is reduced. Detailed results are obtained for a D3Q19 LBM solver, which is benchmarked on nVidia K5000M and K20C GPUs. In the latter case the use of a read-only data cache is explored, and peak performance of over 1036 Million Lattice Updates Per Second (MLUPS) is achieved. The appearance of a periodic bottleneck in the solver performance is also reported, believed to be hardware related; spikes in iteration-time occur with a frequency of around 11 Hz for both GPUs, independent of the size of the problem.
Industry survey of space system cost benefits from New Ways Of Doing Business
NASA Technical Reports Server (NTRS)
Rosmait, Russell L.
1992-01-01
The cost of designing, building and operating space system hardware has always been expensive. Small quantities of specialty parts escalate engineering design, production and operations cost. Funding cutbacks and shrinking revenues dictate aggressive cost saving programs. NASA's highest priority is providing economical transportation to and from space. Over the past three decades NASA has seen technological advances that provide grater efficiencies in designing, building, and operating of space system hardware. As future programs such as NLS, LUTE and SEI begin, these greater efficiencies and cost savings should be reflected in the cost models. There are several New Ways Of Doing Business (NWODB) which, when fully implemented will reduce space system costs. These philosophies and/or culture changes are integrated in five areas: (1) More Extensive Pre-Phase C/D & E, (2) Multi Year Funding Stability, (3) Improved Quality, Management and Procurement Processes, (4) Advanced Design Methods, and (5) Advanced Production Methods. Following is an overview of NWODB and the Cost Quantification Analysis results using an industry survey, one of the four quantification techniques used in the study. The NWODB Cost Quantification Analysis is a study performed at Marshall Space Flight Center by the Engineering Cost Group, Applied Research Incorporated and Pittsburg State University. This study took place over a period of four months in mid 1992. The purpose of the study was to identify potential NWODB which could lead to improved cost effectiveness within NASA and to quantify potential cost benefits that might accrue if these NWODB were implemented.
Large-N correlator systems for low frequency radio astronomy
NASA Astrophysics Data System (ADS)
Foster, Griffin
Low frequency radio astronomy has entered a second golden age driven by the development of a new class of large-N interferometric arrays. The low frequency array (LOFAR) and a number of redshifted HI Epoch of Reionization (EoR) arrays are currently undergoing commission and regularly observing. Future arrays of unprecedented sensitivity and resolutions at low frequencies, such as the square kilometer array (SKA) and the hydrogen epoch of reionization array (HERA), are in development. The combination of advancements in specialized field programmable gate array (FPGA) hardware for signal processing, computing and graphics processing unit (GPU) resources, and new imaging and calibration algorithms has opened up the oft underused radio band below 300 MHz. These interferometric arrays require efficient implementation of digital signal processing (DSP) hardware to compute the baseline correlations. FPGA technology provides an optimal platform to develop new correlators. The significant growth in data rates from these systems requires automated software to reduce the correlations in real time before storing the data products to disk. Low frequency, widefield observations introduce a number of unique calibration and imaging challenges. The efficient implementation of FX correlators using FPGA hardware is presented. Two correlators have been developed, one for the 32 element BEST-2 array at Medicina Observatory and the other for the 96 element LOFAR station at Chilbolton Observatory. In addition, calibration and imaging software has been developed for each system which makes use of the radio interferometry measurement equation (RIME) to derive calibrations. A process for generating sky maps from widefield LOFAR station observations is presented. Shapelets, a method of modelling extended structures such as resolved sources and beam patterns has been adapted for radio astronomy use to further improve system calibration. Scaling of computing technology allows for the development of larger correlator systems, which in turn allows for improvements in sensitivity and resolution. This requires new calibration techniques which account for a broad range of systematic effects.
NASA Astrophysics Data System (ADS)
Elkurdi, Yousef; Fernández, David; Souleimanov, Evgueni; Giannacopoulos, Dennis; Gross, Warren J.
2008-04-01
The Finite Element Method (FEM) is a computationally intensive scientific and engineering analysis tool that has diverse applications ranging from structural engineering to electromagnetic simulation. The trends in floating-point performance are moving in favor of Field-Programmable Gate Arrays (FPGAs), hence increasing interest has grown in the scientific community to exploit this technology. We present an architecture and implementation of an FPGA-based sparse matrix-vector multiplier (SMVM) for use in the iterative solution of large, sparse systems of equations arising from FEM applications. FEM matrices display specific sparsity patterns that can be exploited to improve the efficiency of hardware designs. Our architecture exploits FEM matrix sparsity structure to achieve a balance between performance and hardware resource requirements by relying on external SDRAM for data storage while utilizing the FPGAs computational resources in a stream-through systolic approach. The architecture is based on a pipelined linear array of processing elements (PEs) coupled with a hardware-oriented matrix striping algorithm and a partitioning scheme which enables it to process arbitrarily big matrices without changing the number of PEs in the architecture. Therefore, this architecture is only limited by the amount of external RAM available to the FPGA. The implemented SMVM-pipeline prototype contains 8 PEs and is clocked at 110 MHz obtaining a peak performance of 1.76 GFLOPS. For 8 GB/s of memory bandwidth typical of recent FPGA systems, this architecture can achieve 1.5 GFLOPS sustained performance. Using multiple instances of the pipeline, linear scaling of the peak and sustained performance can be achieved. Our stream-through architecture provides the added advantage of enabling an iterative implementation of the SMVM computation required by iterative solution techniques such as the conjugate gradient method, avoiding initialization time due to data loading and setup inside the FPGA internal memory.
Macyszyn, Luke; Lega, Brad; Bohman, Leif-Erik; Latefi, Ahmad; Smith, Michelle J; Malhotra, Neil R; Welch, William; Grady, Sean M
2013-09-01
Digital radiology enhances productivity and results in long-term cost savings. However, the viewing, storage, and sharing of outside imaging studies on compact discs at ambulatory offices and hospitals pose a number of unique challenges to a surgeon's efficiency and clinical workflow. To improve the efficiency and clinical workflow of an academic neurosurgical practice when evaluating patients with outside radiological studies. Open-source software and commercial hardware were used to design and implement a departmental picture archiving and communications system (PACS). The implementation of a departmental PACS system significantly improved productivity and enhanced collaboration in a variety of clinical settings. Using published data on the rate of information technology problems associated with outside studies on compact discs, this system produced a cost savings ranging from $6250 to $33600 and from $43200 to $72000 for 2 cohorts, urgent transfer and spine clinic patients, respectively, therefore justifying the costs of the system in less than a year. The implementation of a departmental PACS system using open-source software is straightforward and cost-effective and results in significant gains in surgeon productivity when evaluating patients with outside imaging studies.
Design Report for Low Power Acoustic Detector
2013-08-01
high speed integrated circuit (VHSIC) hardware description language ( VHDL ) implementation of both the HED and DCD detectors. Figures 4 and 5 show the...the hardware design, target detection algorithm design in both MATLAB and VHDL , and typical performance results. 15. SUBJECT TERMS Acoustic low...5 2.4 Algorithm Implementation ..............................................................................................6 3. Testing
FPGA implementation of self organizing map with digital phase locked loops.
Hikawa, Hiroomi
2005-01-01
The self-organizing map (SOM) has found applicability in a wide range of application areas. Recently new SOM hardware with phase modulated pulse signal and digital phase-locked loops (DPLLs) has been proposed (Hikawa, 2005). The system uses the DPLL as a computing element since the operation of the DPLL is very similar to that of SOM's computation. The system also uses square waveform phase to hold the value of the each input vector element. This paper discuss the hardware implementation of the DPLL SOM architecture. For effective hardware implementation, some components are redesigned to reduce the circuit size. The proposed SOM architecture is described in VHDL and implemented on field programmable gate array (FPGA). Its feasibility is verified by experiments. Results show that the proposed SOM implemented on the FPGA has a good quantization capability, and its circuit size very small.
International Space Station alpha remote manipulator system workstation controls test report
NASA Astrophysics Data System (ADS)
Ehrenstrom, William A.; Swaney, Colin; Forrester, Patrick
1994-05-01
Previous development testing for the space station remote manipulator system workstation controls determined the need for hardware controls for the emergency stop, brakes on/off, and some camera functions. This report documents the results of an evaluation to further determine control implementation requirements, requested by the Canadian Space Agency (CSA), to close outstanding review item discrepancies. This test was conducted at the Johnson Space Center's Space Station Mockup and Trainer Facility in Houston, Texas, with nine NASA astronauts and one CSA astronaut as operators. This test evaluated camera iris and focus, back-up drive, latching end effector release, and autosequence controls using several types of hardware and software implementations. Recommendations resulting from the testing included providing guarded hardware buttons to prevent accidental actuation, providing autosequence controls and back-up drive controls on a dedicated hardware control panel, and that 'latch on/latch off', or on-screen software, controls not be considered. Generally, the operators preferred hardware controls although other control implementations were acceptable. The results of this evaluation will be used along with further testing to define specific requirements for the workstation design.
International Space Station alpha remote manipulator system workstation controls test report
NASA Technical Reports Server (NTRS)
Ehrenstrom, William A.; Swaney, Colin; Forrester, Patrick
1994-01-01
Previous development testing for the space station remote manipulator system workstation controls determined the need for hardware controls for the emergency stop, brakes on/off, and some camera functions. This report documents the results of an evaluation to further determine control implementation requirements, requested by the Canadian Space Agency (CSA), to close outstanding review item discrepancies. This test was conducted at the Johnson Space Center's Space Station Mockup and Trainer Facility in Houston, Texas, with nine NASA astronauts and one CSA astronaut as operators. This test evaluated camera iris and focus, back-up drive, latching end effector release, and autosequence controls using several types of hardware and software implementations. Recommendations resulting from the testing included providing guarded hardware buttons to prevent accidental actuation, providing autosequence controls and back-up drive controls on a dedicated hardware control panel, and that 'latch on/latch off', or on-screen software, controls not be considered. Generally, the operators preferred hardware controls although other control implementations were acceptable. The results of this evaluation will be used along with further testing to define specific requirements for the workstation design.
Fast Image Texture Classification Using Decision Trees
NASA Technical Reports Server (NTRS)
Thompson, David R.
2011-01-01
Texture analysis would permit improved autonomous, onboard science data interpretation for adaptive navigation, sampling, and downlink decisions. These analyses would assist with terrain analysis and instrument placement in both macroscopic and microscopic image data products. Unfortunately, most state-of-the-art texture analysis demands computationally expensive convolutions of filters involving many floating-point operations. This makes them infeasible for radiation- hardened computers and spaceflight hardware. A new method approximates traditional texture classification of each image pixel with a fast decision-tree classifier. The classifier uses image features derived from simple filtering operations involving integer arithmetic. The texture analysis method is therefore amenable to implementation on FPGA (field-programmable gate array) hardware. Image features based on the "integral image" transform produce descriptive and efficient texture descriptors. Training the decision tree on a set of training data yields a classification scheme that produces reasonable approximations of optimal "texton" analysis at a fraction of the computational cost. A decision-tree learning algorithm employing the traditional k-means criterion of inter-cluster variance is used to learn tree structure from training data. The result is an efficient and accurate summary of surface morphology in images. This work is an evolutionary advance that unites several previous algorithms (k-means clustering, integral images, decision trees) and applies them to a new problem domain (morphology analysis for autonomous science during remote exploration). Advantages include order-of-magnitude improvements in runtime, feasibility for FPGA hardware, and significant improvements in texture classification accuracy.
Real-time visual tracking of less textured three-dimensional objects on mobile platforms
NASA Astrophysics Data System (ADS)
Seo, Byung-Kuk; Park, Jungsik; Park, Hanhoon; Park, Jong-Il
2012-12-01
Natural feature-based approaches are still challenging for mobile applications (e.g., mobile augmented reality), because they are feasible only in limited environments such as highly textured and planar scenes/objects, and they need powerful mobile hardware for fast and reliable tracking. In many cases where conventional approaches are not effective, three-dimensional (3-D) knowledge of target scenes would be beneficial. We present a well-established framework for real-time visual tracking of less textured 3-D objects on mobile platforms. Our framework is based on model-based tracking that efficiently exploits partially known 3-D scene knowledge such as object models and a background's distinctive geometric or photometric knowledge. Moreover, we elaborate on implementation in order to make it suitable for real-time vision processing on mobile hardware. The performance of the framework is tested and evaluated on recent commercially available smartphones, and its feasibility is shown by real-time demonstrations.
Software control and system configuration management - A process that works
NASA Technical Reports Server (NTRS)
Petersen, K. L.; Flores, C., Jr.
1983-01-01
A comprehensive software control and system configuration management process for flight-crucial digital control systems of advanced aircraft has been developed and refined to insure efficient flight system development and safe flight operations. Because of the highly complex interactions among the hardware, software, and system elements of state-of-the-art digital flight control system designs, a systems-wide approach to configuration control and management has been used. Specific procedures are implemented to govern discrepancy reporting and reconciliation, software and hardware change control, systems verification and validation testing, and formal documentation requirements. An active and knowledgeable configuration control board reviews and approves all flight system configuration modifications and revalidation tests. This flexible process has proved effective during the development and flight testing of several research aircraft and remotely piloted research vehicles with digital flight control systems that ranged from relatively simple to highly complex, integrated mechanizations.
Matrix-vector multiplication using digital partitioning for more accurate optical computing
NASA Technical Reports Server (NTRS)
Gary, C. K.
1992-01-01
Digital partitioning offers a flexible means of increasing the accuracy of an optical matrix-vector processor. This algorithm can be implemented with the same architecture required for a purely analog processor, which gives optical matrix-vector processors the ability to perform high-accuracy calculations at speeds comparable with or greater than electronic computers as well as the ability to perform analog operations at a much greater speed. Digital partitioning is compared with digital multiplication by analog convolution, residue number systems, and redundant number representation in terms of the size and the speed required for an equivalent throughput as well as in terms of the hardware requirements. Digital partitioning and digital multiplication by analog convolution are found to be the most efficient alogrithms if coding time and hardware are considered, and the architecture for digital partitioning permits the use of analog computations to provide the greatest throughput for a single processor.
Tape SCSI monitoring and encryption at CERN
NASA Astrophysics Data System (ADS)
Laskaridis, Stefanos; Bahyl, V.; Cano, E.; Leduc, J.; Murray, S.; Cancio, G.; Kruse, D.
2017-10-01
CERN currently manages the largest data archive in the HEP domain; over 180PB of custodial data is archived across 7 enterprise tape libraries containing more than 25,000 tapes and using over 100 tape drives. Archival storage at this scale requires a leading edge monitoring infrastructure that acquires live and lifelong metrics from the hardware in order to assess and proactively identify potential drive and media level issues. In addition, protecting the privacy of sensitive archival data is becoming increasingly important and with it the need for a scalable, compute-efficient and cost-effective solution for data encryption. In this paper, we first describe the implementation of acquiring tape medium and drive related metrics reported by the SCSI interface and its integration with our monitoring system. We then address the incorporation of tape drive real-time encryption with dedicated drive hardware into the CASTOR [1] hierarchical mass storage system.
Software control and system configuration management: A systems-wide approach
NASA Technical Reports Server (NTRS)
Petersen, K. L.; Flores, C., Jr.
1984-01-01
A comprehensive software control and system configuration management process for flight-crucial digital control systems of advanced aircraft has been developed and refined to insure efficient flight system development and safe flight operations. Because of the highly complex interactions among the hardware, software, and system elements of state-of-the-art digital flight control system designs, a systems-wide approach to configuration control and management has been used. Specific procedures are implemented to govern discrepancy reporting and reconciliation, software and hardware change control, systems verification and validation testing, and formal documentation requirements. An active and knowledgeable configuration control board reviews and approves all flight system configuration modifications and revalidation tests. This flexible process has proved effective during the development and flight testing of several research aircraft and remotely piloted research vehicles with digital flight control systems that ranged from relatively simple to highly complex, integrated mechanizations.
Many-core graph analytics using accelerated sparse linear algebra routines
NASA Astrophysics Data System (ADS)
Kozacik, Stephen; Paolini, Aaron L.; Fox, Paul; Kelmelis, Eric
2016-05-01
Graph analytics is a key component in identifying emerging trends and threats in many real-world applications. Largescale graph analytics frameworks provide a convenient and highly-scalable platform for developing algorithms to analyze large datasets. Although conceptually scalable, these techniques exhibit poor performance on modern computational hardware. Another model of graph computation has emerged that promises improved performance and scalability by using abstract linear algebra operations as the basis for graph analysis as laid out by the GraphBLAS standard. By using sparse linear algebra as the basis, existing highly efficient algorithms can be adapted to perform computations on the graph. This approach, however, is often less intuitive to graph analytics experts, who are accustomed to vertex-centric APIs such as Giraph, GraphX, and Tinkerpop. We are developing an implementation of the high-level operations supported by these APIs in terms of linear algebra operations. This implementation is be backed by many-core implementations of the fundamental GraphBLAS operations required, and offers the advantages of both the intuitive programming model of a vertex-centric API and the performance of a sparse linear algebra implementation. This technology can reduce the number of nodes required, as well as the run-time for a graph analysis problem, enabling customers to perform more complex analysis with less hardware at lower cost. All of this can be accomplished without the requirement for the customer to make any changes to their analytics code, thanks to the compatibility with existing graph APIs.
NASA Technical Reports Server (NTRS)
Schoenwald, Adam J.; Bradley, Damon C.; Mohammed, Priscilla N.; Piepmeier, Jeffrey R.; Wong, Mark
2016-01-01
Radio-frequency interference (RFI) is a known problem for passive remote sensing as evidenced in the L-band radiometers SMOS, Aquarius and more recently, SMAP. Various algorithms have been developed and implemented on SMAP to improve science measurements. This was achieved by the use of a digital microwave radiometer. RFI mitigation becomes more challenging for microwave radiometers operating at higher frequencies in shared allocations. At higher frequencies larger bandwidths are also desirable for lower measurement noise further adding to processing challenges. This work focuses on finding improved RFI mitigation techniques that will be effective at additional frequencies and at higher bandwidths. To aid the development and testing of applicable detection and mitigation techniques, a wide-band RFI algorithm testing environment has been developed using the Reconfigurable Open Architecture Computing Hardware System (ROACH) built by the Collaboration for Astronomy Signal Processing and Electronics Research (CASPER) Group. The testing environment also consists of various test equipment used to reproduce typical signals that a radiometer may see including those with and without RFI. The testing environment permits quick evaluations of RFI mitigation algorithms as well as show that they are implementable in hardware. The algorithm implemented is a complex signal kurtosis detector which was modeled and simulated. The complex signal kurtosis detector showed improved performance over the real kurtosis detector under certain conditions. The real kurtosis is implemented on SMAP at 24 MHz bandwidth. The complex signal kurtosis algorithm was then implemented in hardware at 200 MHz bandwidth using the ROACH. In this work, performance of the complex signal kurtosis and the real signal kurtosis are compared. Performance evaluations and comparisons in both simulation as well as experimental hardware implementations were done with the use of receiver operating characteristic (ROC) curves.
FPGA-Based Stochastic Echo State Networks for Time-Series Forecasting.
Alomar, Miquel L; Canals, Vincent; Perez-Mora, Nicolas; Martínez-Moll, Víctor; Rosselló, Josep L
2016-01-01
Hardware implementation of artificial neural networks (ANNs) allows exploiting the inherent parallelism of these systems. Nevertheless, they require a large amount of resources in terms of area and power dissipation. Recently, Reservoir Computing (RC) has arisen as a strategic technique to design recurrent neural networks (RNNs) with simple learning capabilities. In this work, we show a new approach to implement RC systems with digital gates. The proposed method is based on the use of probabilistic computing concepts to reduce the hardware required to implement different arithmetic operations. The result is the development of a highly functional system with low hardware resources. The presented methodology is applied to chaotic time-series forecasting.
Language Classification using N-grams Accelerated by FPGA-based Bloom Filters
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jacob, A; Gokhale, M
N-Gram (n-character sequences in text documents) counting is a well-established technique used in classifying the language of text in a document. In this paper, n-gram processing is accelerated through the use of reconfigurable hardware on the XtremeData XD1000 system. Our design employs parallelism at multiple levels, with parallel Bloom Filters accessing on-chip RAM, parallel language classifiers, and parallel document processing. In contrast to another hardware implementation (HAIL algorithm) that uses off-chip SRAM for lookup, our highly scalable implementation uses only on-chip memory blocks. Our implementation of end-to-end language classification runs at 85x comparable software and 1.45x the competing hardware design.
FPGA-Based Stochastic Echo State Networks for Time-Series Forecasting
Alomar, Miquel L.; Canals, Vincent; Perez-Mora, Nicolas; Martínez-Moll, Víctor; Rosselló, Josep L.
2016-01-01
Hardware implementation of artificial neural networks (ANNs) allows exploiting the inherent parallelism of these systems. Nevertheless, they require a large amount of resources in terms of area and power dissipation. Recently, Reservoir Computing (RC) has arisen as a strategic technique to design recurrent neural networks (RNNs) with simple learning capabilities. In this work, we show a new approach to implement RC systems with digital gates. The proposed method is based on the use of probabilistic computing concepts to reduce the hardware required to implement different arithmetic operations. The result is the development of a highly functional system with low hardware resources. The presented methodology is applied to chaotic time-series forecasting. PMID:26880876
Area-delay trade-offs of texture decompressors for a graphics processing unit
NASA Astrophysics Data System (ADS)
Novoa Súñer, Emilio; Ituero, Pablo; López-Vallejo, Marisa
2011-05-01
Graphics Processing Units have become a booster for the microelectronics industry. However, due to intellectual property issues, there is a serious lack of information on implementation details of the hardware architecture that is behind GPUs. For instance, the way texture is handled and decompressed in a GPU to reduce bandwidth usage has never been dealt with in depth from a hardware point of view. This work addresses a comparative study on the hardware implementation of different texture decompression algorithms for both conventional (PCs and video game consoles) and mobile platforms. Circuit synthesis is performed targeting both a reconfigurable hardware platform and a 90nm standard cell library. Area-delay trade-offs have been extensively analyzed, which allows us to compare the complexity of decompressors and thus determine suitability of algorithms for systems with limited hardware resources.
NASA Technical Reports Server (NTRS)
Cottingham, Christine; Dwivedi, Vivek H.; Peters, Carlton; Powers, Daniel; Yang, Kan
2012-01-01
The Global Precipitation Measurement mission is a joint NASA/JAXA mission scheduled for launch in late 2013. The integration of thermal hardware onto the satellite began in the Fall of 2010 and will continue through the Summer of 2012. The thermal hardware on the mission included several constant conductance heat pipes, heaters, thermostats, thermocouples radiator coatings and blankets. During integration several problems arose and insights were gained that would help future satellite integrations. Also lessons learned from previous missions were implemented with varying degrees of success. These insights can be arranged into three categories. 1) the specification of flight hardware using analysis results and the available mechanical resources. 2) The integration of thermal flight hardware onto the spacecraft, 3) The preparation and implementation of testing the thermal flight via touch tests, resistance measurements and thermal vacuum testing.
Lok, U-Wai; Li, Pai-Chi
2016-03-01
Graphics processing unit (GPU)-based software beamforming has advantages over hardware-based beamforming of easier programmability and a faster design cycle, since complicated imaging algorithms can be efficiently programmed and modified. However, the need for a high data rate when transferring ultrasound radio-frequency (RF) data from the hardware front end to the software back end limits the real-time performance. Data compression methods can be applied to the hardware front end to mitigate the data transfer issue. Nevertheless, most decompression processes cannot be performed efficiently on a GPU, thus becoming another bottleneck of the real-time imaging. Moreover, lossless (or nearly lossless) compression is desirable to avoid image quality degradation. In a previous study, we proposed a real-time lossless compression-decompression algorithm and demonstrated that it can reduce the overall processing time because the reduction in data transfer time is greater than the computation time required for compression/decompression. This paper analyzes the lossless compression method in order to understand the factors limiting the compression efficiency. Based on the analytical results, a nearly lossless compression is proposed to further enhance the compression efficiency. The proposed method comprises a transformation coding method involving modified lossless compression that aims at suppressing amplitude data. The simulation results indicate that the compression ratio (CR) of the proposed approach can be enhanced from nearly 1.8 to 2.5, thus allowing a higher data acquisition rate at the front end. The spatial and contrast resolutions with and without compression were almost identical, and the process of decompressing the data of a single frame on a GPU took only several milliseconds. Moreover, the proposed method has been implemented in a 64-channel system that we built in-house to demonstrate the feasibility of the proposed algorithm in a real system. It was found that channel data from a 64-channel system can be transferred using the standard USB 3.0 interface in most practical imaging applications.
Algorithm Engineering: Concepts and Practice
NASA Astrophysics Data System (ADS)
Chimani, Markus; Klein, Karsten
Over the last years the term algorithm engineering has become wide spread synonym for experimental evaluation in the context of algorithm development. Yet it implies even more. We discuss the major weaknesses of traditional "pen and paper" algorithmics and the ever-growing gap between theory and practice in the context of modern computer hardware and real-world problem instances. We present the key ideas and concepts of the central algorithm engineering cycle that is based on a full feedback loop: It starts with the design of the algorithm, followed by the analysis, implementation, and experimental evaluation. The results of the latter can then be reused for modifications to the algorithmic design, stronger or input-specific theoretic performance guarantees, etc. We describe the individual steps of the cycle, explaining the rationale behind them and giving examples of how to conduct these steps thoughtfully. Thereby we give an introduction to current algorithmic key issues like I/O-efficient or parallel algorithms, succinct data structures, hardware-aware implementations, and others. We conclude with two especially insightful success stories—shortest path problems and text search—where the application of algorithm engineering techniques led to tremendous performance improvements compared with previous state-of-the-art approaches.
Co-design of software and hardware to implement remote sensing algorithms
NASA Astrophysics Data System (ADS)
Theiler, James P.; Frigo, Janette R.; Gokhale, Maya; Szymanski, John J.
2002-01-01
Both for offline searches through large data archives and for onboard computation at the sensor head, there is a growing need for ever-more rapid processing of remote sensing data. For many algorithms of use in remote sensing, the bulk of the processing takes place in an ``inner loop'' with a large number of simple operations. For these algorithms, dramatic speedups can often be obtained with specialized hardware. The difficulty and expense of digital design continues to limit applicability of this approach, but the development of new design tools is making this approach more feasible, and some notable successes have been reported. On the other hand, it is often the case that processing can also be accelerated by adopting a more sophisticated algorithm design. Unfortunately, a more sophisticated algorithm is much harder to implement in hardware, so these approaches are often at odds with each other. With careful planning, however, it is sometimes possible to combine software and hardware design in such a way that each complements the other, and the final implementation achieves speedup that would not have been possible with a hardware-only or a software-only solution. We will in particular discuss the co-design of software and hardware to achieve substantial speedup of algorithms for multispectral image segmentation and for endmember identification.
Real-time skin feature identification in a time-sequential video stream
NASA Astrophysics Data System (ADS)
Kramberger, Iztok
2005-04-01
Skin color can be an important feature when tracking skin-colored objects. Particularly this is the case for computer-vision-based human-computer interfaces (HCI). Humans have a highly developed feeling of space and, therefore, it is reasonable to support this within intelligent HCI, where the importance of augmented reality can be foreseen. Joining human-like interaction techniques within multimodal HCI could, or will, gain a feature for modern mobile telecommunication devices. On the other hand, real-time processing plays an important role in achieving more natural and physically intuitive ways of human-machine interaction. The main scope of this work is the development of a stereoscopic computer-vision hardware-accelerated framework for real-time skin feature identification in the sense of a single-pass image segmentation process. The hardware-accelerated preprocessing stage is presented with the purpose of color and spatial filtering, where the skin color model within the hue-saturation-value (HSV) color space is given with a polyhedron of threshold values representing the basis of the filter model. An adaptive filter management unit is suggested to achieve better segmentation results. This enables the adoption of filter parameters to the current scene conditions in an adaptive way. Implementation of the suggested hardware structure is given at the level of filed programmable system level integrated circuit (FPSLIC) devices using an embedded microcontroller as their main feature. A stereoscopic clue is achieved using a time-sequential video stream, but this shows no difference for real-time processing requirements in terms of hardware complexity. The experimental results for the hardware-accelerated preprocessing stage are given by efficiency estimation of the presented hardware structure using a simple motion-detection algorithm based on a binary function.
Real-time orthorectification by FPGA-based hardware acceleration
NASA Astrophysics Data System (ADS)
Kuo, David; Gordon, Don
2010-10-01
Orthorectification that corrects the perspective distortion of remote sensing imagery, providing accurate geolocation and ease of correlation to other images is a valuable first-step in image processing for information extraction. However, the large amount of metadata and the floating-point matrix transformations required to operate on each pixel make this a computation and I/O (Input/Output) intensive process. As result much imagery is either left unprocessed or loses timesensitive value in the long processing cycle. However, the computation on each pixel can be reduced substantially by using computational results of the neighboring pixels and accelerated by special pipelined hardware architecture in one to two orders of magnitude. A specialized coprocessor that is implemented inside an FPGA (Field Programmable Gate Array) chip and surrounded by vendorsupported hardware IP (Intellectual Property) shares the computation workload with CPU through PCI-Express interface. The ultimate speed of one pixel per clock (125 MHz) is achieved by the pipelined systolic array architecture. The optimal partition between software and hardware, the timing profile among image I/O and computation, and the highly automated GUI (Graphical User Interface) that fully exploits this speed increase to maximize overall image production throughput will also be discussed. The software that runs on a workstation with the acceleration hardware orthorectifies 16 Megapixels per second, which is 16 times faster than without the hardware. It turns the production time from months to days. A real-life successful story of an imaging satellite company that adopted such workstations for their orthorectified imagery production will be presented. The potential candidacy of the image processing computation that can be accelerated more efficiently by the same approach will also be analyzed.
Software design and implementation concepts for an interoperable medical communication framework.
Besting, Andreas; Bürger, Sebastian; Kasparick, Martin; Strathen, Benjamin; Portheine, Frank
2018-02-23
The new IEEE 11073 service-oriented device connectivity (SDC) standard proposals for networked point-of-care and surgical devices constitutes the basis for improved interoperability due to its independence of vendors. To accelerate the distribution of the standard a reference implementation is indispensable. However, the implementation of such a framework has to overcome several non-trivial challenges. First, the high level of complexity of the underlying standard must be reflected in the software design. An efficient implementation has to consider the limited resources of the underlying hardware. Moreover, the frameworks purpose of realizing a distributed system demands a high degree of reliability of the framework itself and its internal mechanisms. Additionally, a framework must provide an easy-to-use and fail-safe application programming interface (API). In this work, we address these challenges by discussing suitable software engineering principles and practical coding guidelines. A descriptive model is developed that identifies key strategies. General feasibility is shown by outlining environments in which our implementation has been utilized.
Energy efficient engine low-pressure compressor component test hardware detailed design report
NASA Technical Reports Server (NTRS)
Michael, C. J.; Halle, J. E.
1981-01-01
The aerodynamic and mechanical design description of the low pressure compressor component of the Energy Efficient Engine were used. The component was designed to meet the requirements of the Flight Propulsion System while maintaining a low cost approach in providing a low pressure compressor design for the Integrated Core/Low Spool test required in the Energy Efficient Engine Program. The resulting low pressure compressor component design meets or exceeds all design goals with the exception of surge margin. In addition, the expense of hardware fabrication for the Integrated Core/Low Spool test has been minimized through the use of existing minor part hardware.
Design and Implementation of Embedded Computer Vision Systems Based on Particle Filters
2010-01-01
for hardware/software implementa- tion of multi-dimensional particle filter application and we explore this in the third application which is a 3D...methodology for hardware/software implementation of multi-dimensional particle filter application and we explore this in the third application which is a...and hence multiprocessor implementation of parti- cle filters is an important option to examine. A significant body of work exists on optimizing generic
The use of UNIX in a real-time environment
NASA Technical Reports Server (NTRS)
Luken, R. D.; Simons, P. C.
1986-01-01
This paper describes a project to evaluate the feasibility of using commercial off-the-shelf hardware and the UNIX operating system, to implement a real-time control and monitor system. A functional subset of the Checkout, Control and Monitor System was chosen as the test bed for the project. The project consists of three separate architecture implementations: a local area bus network, a star network, and a central host. The motivation for this project stemmed from the need to find a way to implement real-time systems, without the cost burden of developing and maintaining custom hardware and unique software. This has always been accepted as the only option because of the need to optimize the implementation for performance. However, with the cost/performance of today's hardware, the inefficiencies of high-level languages and portable operating systems can be effectively overcome.
Input/output models for general aviation piston-prop aircraft fuel economy
NASA Technical Reports Server (NTRS)
Sweet, L. M.
1982-01-01
A fuel efficient cruise performance model for general aviation piston engine airplane was tested. The following equations were made: (1) for the standard atmosphere; (2) airframe-propeller-atmosphere cruise performance; and (3) naturally aspirated engine cruise performance. Adjustments are made to the compact cruise performance model as follows: corrected quantities, corrected performance plots, algebraic equations, maximize R with or without constraints, and appears suitable for airborne microprocessor implementation. The following hardwares are recommended: ignition timing regulator, fuel-air mass ration controller, microprocessor, sensors and displays.
The evolution of Orbiter depot support, with applications to future space vehicles
NASA Technical Reports Server (NTRS)
Mcclain, Michael L.
1990-01-01
The reasons for depot consolidation and the processes established to implement the Orbiter depot are presented. The Space Shuttle Orbiter depot support is presently being consolidated due to equipment suppliers leaving the program, escalating depot support costs, and increasing repair turnaround times. Details of the depot support program for orbiter hardware and selected pieces of support equipment are discussed. The benefits gained from this consolidation and the lessons learned are then applied to future reuseable space vehicles to provide program managers a forward look at the need for efficient depot support.
Development of a space-systems network testbed
NASA Technical Reports Server (NTRS)
Lala, Jaynarayan; Alger, Linda; Adams, Stuart; Burkhardt, Laura; Nagle, Gail; Murray, Nicholas
1988-01-01
This paper describes a communications network testbed which has been designed to allow the development of architectures and algorithms that meet the functional requirements of future NASA communication systems. The central hardware components of the Network Testbed are programmable circuit switching communication nodes which can be adapted by software or firmware changes to customize the testbed to particular architectures and algorithms. Fault detection, isolation, and reconfiguration has been implemented in the Network with a hybrid approach which utilizes features of both centralized and distributed techniques to provide efficient handling of faults within the Network.
Self-navigation of a scanning tunneling microscope tip toward a micron-sized graphene sample.
Li, Guohong; Luican, Adina; Andrei, Eva Y
2011-07-01
We demonstrate a simple capacitance-based method to quickly and efficiently locate micron-sized conductive samples, such as graphene flakes, on insulating substrates in a scanning tunneling microscope (STM). By using edge recognition, the method is designed to locate and to identify small features when the STM tip is far above the surface, allowing for crash-free search and navigation. The method can be implemented in any STM environment, even at low temperatures and in strong magnetic field, with minimal or no hardware modifications.
Towards a flexible middleware for context-aware pervasive and wearable systems.
Muro, Marco; Amoretti, Michele; Zanichelli, Francesco; Conte, Gianni
2012-11-01
Ambient intelligence and wearable computing call for innovative hardware and software technologies, including a highly capable, flexible and efficient middleware, allowing for the reuse of existing pervasive applications when developing new ones. In the considered application domain, middleware should also support self-management, interoperability among different platforms, efficient communications, and context awareness. In the on-going "everything is networked" scenario scalability appears as a very important issue, for which the peer-to-peer (P2P) paradigm emerges as an appealing solution for connecting software components in an overlay network, allowing for efficient and balanced data distribution mechanisms. In this paper, we illustrate how all these concepts can be placed into a theoretical tool, called networked autonomic machine (NAM), implemented into a NAM-based middleware, and evaluated against practical problems of pervasive computing.
NASA Astrophysics Data System (ADS)
Zhang, M.; Zheng, G. Z.; Zheng, W.; Chen, Z.; Yuan, T.; Yang, C.
2016-04-01
The magnetic confinement nuclear fusion experiments require various real-time control applications like plasma control. ITER has designed the Fast Plant System Controller (FPSC) for this job. ITER provided hardware and software standards and guidelines for building a FPSC. In order to develop various real-time FPSC applications efficiently, a flexible real-time software framework called J-TEXT real-time framework (JRTF) is developed by J-TEXT tokamak team. JRTF allowed developers to implement different functions as independent and reusable modules called Application Blocks (AB). The AB developers only need to focus on implementing the control tasks or the algorithms. The timing, scheduling, data sharing and eventing are handled by the JRTF pipelines. JRTF provides great flexibility on developing ABs. Unit test against ABs can be developed easily and ABs can even be used in non-JRTF applications. JRTF also provides interfaces allowing JRTF applications to be configured and monitored at runtime. JRTF is compatible with ITER standard FPSC hardware and ITER (Control, Data Access and Communication) CODAC Core software. It can be configured and monitored using (Experimental Physics and Industrial Control System) EPICS. Moreover the JRTF can be ported to different platforms and be integrated with supervisory control software other than EPICS. The paper presents the design and implementation of JRTF as well as brief test results.
NASA Technical Reports Server (NTRS)
Feher, Kamilo
1993-01-01
The performance and implementation complexity of coherent and of noncoherent QPSK and GMSK modulation/demodulation techniques in a complex mobile satellite systems environment, including large Doppler shift, delay spread, and low C/I, are compared. We demonstrate that for large f(sub d)T(sub b) products, where f(sub d) is the Doppler shift and T(sub b) is the bit duration, noncoherent (discriminator detector or differential demodulation) systems have a lower BER floor than their coherent counterparts. For significant delay spreads, e.g., tau(sub rms) greater than 0.4 T(sub b), and low C/I, coherent systems outperform noncoherent systems. However, the synchronization time of coherent systems is longer than that of noncoherent systems. Spectral efficiency, overall capacity, and related hardware complexity issues of these systems are also analyzed. We demonstrate that coherent systems have a simpler overall architecture (IF filter implementation-cost versus carrier recovery) and are more robust in an RF frequency drift environment. Additionally, the prediction tools, computer simulations, and analysis of coherent systems is simpler. The threshold or capture effect in low C/I interference environment is critical for noncoherent discriminator based systems. We conclude with a comparison of hardware architectures of coherent and of noncoherent systems, including recent trends in commercial VLSI technology and direct baseband to RF transmit, RF to baseband (0-IF) receiver implementation strategies.
NASA Astrophysics Data System (ADS)
Feher, Kamilo
The performance and implementation complexity of coherent and of noncoherent QPSK and GMSK modulation/demodulation techniques in a complex mobile satellite systems environment, including large Doppler shift, delay spread, and low C/I, are compared. We demonstrate that for large f(sub d)T(sub b) products, where f(sub d) is the Doppler shift and T(sub b) is the bit duration, noncoherent (discriminator detector or differential demodulation) systems have a lower BER floor than their coherent counterparts. For significant delay spreads, e.g., tau(sub rms) greater than 0.4 T(sub b), and low C/I, coherent systems outperform noncoherent systems. However, the synchronization time of coherent systems is longer than that of noncoherent systems. Spectral efficiency, overall capacity, and related hardware complexity issues of these systems are also analyzed. We demonstrate that coherent systems have a simpler overall architecture (IF filter implementation-cost versus carrier recovery) and are more robust in an RF frequency drift environment. Additionally, the prediction tools, computer simulations, and analysis of coherent systems is simpler. The threshold or capture effect in low C/I interference environment is critical for noncoherent discriminator based systems. We conclude with a comparison of hardware architectures of coherent and of noncoherent systems, including recent trends in commercial VLSI technology and direct baseband to RF transmit, RF to baseband (0-IF) receiver implementation strategies.
All-optical reservoir computer based on saturation of absorption.
Dejonckheere, Antoine; Duport, François; Smerieri, Anteo; Fang, Li; Oudar, Jean-Louis; Haelterman, Marc; Massar, Serge
2014-05-05
Reservoir computing is a new bio-inspired computation paradigm. It exploits a dynamical system driven by a time-dependent input to carry out computation. For efficient information processing, only a few parameters of the reservoir needs to be tuned, which makes it a promising framework for hardware implementation. Recently, electronic, opto-electronic and all-optical experimental reservoir computers were reported. In those implementations, the nonlinear response of the reservoir is provided by active devices such as optoelectronic modulators or optical amplifiers. By contrast, we propose here the first reservoir computer based on a fully passive nonlinearity, namely the saturable absorption of a semiconductor mirror. Our experimental setup constitutes an important step towards the development of ultrafast low-consumption analog computers.
Optimized design of embedded DSP system hardware supporting complex algorithms
NASA Astrophysics Data System (ADS)
Li, Yanhua; Wang, Xiangjun; Zhou, Xinling
2003-09-01
The paper presents an optimized design method for a flexible and economical embedded DSP system that can implement complex processing algorithms as biometric recognition, real-time image processing, etc. It consists of a floating-point DSP, 512 Kbytes data RAM, 1 Mbytes FLASH program memory, a CPLD for achieving flexible logic control of input channel and a RS-485 transceiver for local network communication. Because of employing a high performance-price ratio DSP TMS320C6712 and a large FLASH in the design, this system permits loading and performing complex algorithms with little algorithm optimization and code reduction. The CPLD provides flexible logic control for the whole DSP board, especially in input channel, and allows convenient interface between different sensors and DSP system. The transceiver circuit can transfer data between DSP and host computer. In the paper, some key technologies are also introduced which make the whole system work efficiently. Because of the characters referred above, the hardware is a perfect flat for multi-channel data collection, image processing, and other signal processing with high performance and adaptability. The application section of this paper presents how this hardware is adapted for the biometric identification system with high identification precision. The result reveals that this hardware is easy to interface with a CMOS imager and is capable of carrying out complex biometric identification algorithms, which require real-time process.
The Art of Space Flight Exercise Hardware: Design and Implementation
NASA Technical Reports Server (NTRS)
Beyene, Nahom M.
2004-01-01
The design of space flight exercise hardware depends on experience with crew health maintenance in a microgravity environment, history in development of flight-quality exercise hardware, and a foundation for certifying proper project management and design methodology. Developed over the past 40 years, the expertise in designing exercise countermeasures hardware at the Johnson Space Center stems from these three aspects of design. The medical community has steadily pursued an understanding of physiological changes in humans in a weightless environment and methods of counteracting negative effects on the cardiovascular and musculoskeletal system. The effects of weightlessness extend to the pulmonary and neurovestibular system as well with conditions ranging from motion sickness to loss of bone density. Results have shown losses in water weight and muscle mass in antigravity muscle groups. With the support of university-based research groups and partner space agencies, NASA has identified exercise to be the primary countermeasure for long-duration space flight. The history of exercise hardware began during the Apollo Era and leads directly to the present hardware on the International Space Station. Under the classifications of aerobic and resistive exercise, there is a clear line of development from the early devices to the countermeasures hardware used today. In support of all engineering projects, the engineering directorate has created a structured framework for project management. Engineers have identified standards and "best practices" to promote efficient and elegant design of space exercise hardware. The quality of space exercise hardware depends on how well hardware requirements are justified by exercise performance guidelines and crew health indicators. When considering the microgravity environment of the device, designers must consider performance of hardware separately from the combined human-in-hardware system. Astronauts are the caretakers of the hardware while it is deployed and conduct all sanitization, calibration, and maintenance for the devices. Thus, hardware designs must account for these issues with a goal of minimizing crew time on orbit required to complete these tasks. In the future, humans will venture to Mars and exercise countermeasures will play a critical role in allowing us to continue in our spirit of exploration. NASA will benefit from further experimentation on Earth, through the International Space Station, and with advanced biomechanical models to quantify how each device counteracts specific symptoms of weightlessness. With the continued support of international space agencies and the academic research community, we will usher the next frontier in human space exploration.
Software Coherence in Multiprocessor Memory Systems. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Bolosky, William Joseph
1993-01-01
Processors are becoming faster and multiprocessor memory interconnection systems are not keeping up. Therefore, it is necessary to have threads and the memory they access as near one another as possible. Typically, this involves putting memory or caches with the processors, which gives rise to the problem of coherence: if one processor writes an address, any other processor reading that address must see the new value. This coherence can be maintained by the hardware or with software intervention. Systems of both types have been built in the past; the hardware-based systems tended to outperform the software ones. However, the ratio of processor to interconnect speed is now so high that the extra overhead of the software systems may no longer be significant. This issue is explored both by implementing a software maintained system and by introducing and using the technique of offline optimal analysis of memory reference traces. It finds that in properly built systems, software maintained coherence can perform comparably to or even better than hardware maintained coherence. The architectural features necessary for efficient software coherence to be profitable include a small page size, a fast trap mechanism, and the ability to execute instructions while remote memory references are outstanding.
SAD-Based Stereo Vision Machine on a System-on-Programmable-Chip (SoPC)
Zhang, Xiang; Chen, Zhangwei
2013-01-01
This paper, proposes a novel solution for a stereo vision machine based on the System-on-Programmable-Chip (SoPC) architecture. The SOPC technology provides great convenience for accessing many hardware devices such as DDRII, SSRAM, Flash, etc., by IP reuse. The system hardware is implemented in a single FPGA chip involving a 32-bit Nios II microprocessor, which is a configurable soft IP core in charge of managing the image buffer and users' configuration data. The Sum of Absolute Differences (SAD) algorithm is used for dense disparity map computation. The circuits of the algorithmic module are modeled by the Matlab-based DSP Builder. With a set of configuration interfaces, the machine can process many different sizes of stereo pair images. The maximum image size is up to 512 K pixels. This machine is designed to focus on real time stereo vision applications. The stereo vision machine offers good performance and high efficiency in real time. Considering a hardware FPGA clock of 90 MHz, 23 frames of 640 × 480 disparity maps can be obtained in one second with 5 × 5 matching window and maximum 64 disparity pixels. PMID:23459385
NASA Technical Reports Server (NTRS)
Tobagi, Fouad A.; Dalgic, Ismail; Pang, Joseph
1990-01-01
The design and implementation of interface units for high speed Fiber Optic Local Area Networks and Broadband Integrated Services Digital Networks are discussed. During the last years, a number of network adapters that are designed to support high speed communications have emerged. This approach to the design of a high speed network interface unit was to implement package processing functions in hardware, using VLSI technology. The VLSI hardware implementation of a buffer management unit, which is required in such architectures, is described.
Multivariate Cryptography Based on Clipped Hopfield Neural Network.
Wang, Jia; Cheng, Lee-Ming; Su, Tong
2018-02-01
Designing secure and efficient multivariate public key cryptosystems [multivariate cryptography (MVC)] to strengthen the security of RSA and ECC in conventional and quantum computational environment continues to be a challenging research in recent years. In this paper, we will describe multivariate public key cryptosystems based on extended Clipped Hopfield Neural Network (CHNN) and implement it using the MVC (CHNN-MVC) framework operated in space. The Diffie-Hellman key exchange algorithm is extended into the matrix field, which illustrates the feasibility of its new applications in both classic and postquantum cryptography. The efficiency and security of our proposed new public key cryptosystem CHNN-MVC are simulated and found to be NP-hard. The proposed algorithm will strengthen multivariate public key cryptosystems and allows hardware realization practicality.
On the impact of approximate computation in an analog DeSTIN architecture.
Young, Steven; Lu, Junjie; Holleman, Jeremy; Arel, Itamar
2014-05-01
Deep machine learning (DML) holds the potential to revolutionize machine learning by automating rich feature extraction, which has become the primary bottleneck of human engineering in pattern recognition systems. However, the heavy computational burden renders DML systems implemented on conventional digital processors impractical for large-scale problems. The highly parallel computations required to implement large-scale deep learning systems are well suited to custom hardware. Analog computation has demonstrated power efficiency advantages of multiple orders of magnitude relative to digital systems while performing nonideal computations. In this paper, we investigate typical error sources introduced by analog computational elements and their impact on system-level performance in DeSTIN--a compositional deep learning architecture. These inaccuracies are evaluated on a pattern classification benchmark, clearly demonstrating the robustness of the underlying algorithm to the errors introduced by analog computational elements. A clear understanding of the impacts of nonideal computations is necessary to fully exploit the efficiency of analog circuits.
NASA Technical Reports Server (NTRS)
Spencer, James E., Jr.; Looney, Joe
1994-01-01
In this paper, the prime objective is to describe a custom 4-dof (degree-of-freedom) robotic arm capable of autonomously or telerobotically performing systematic HEPA filter inspection and certification in the Shuttle Launch Pad Payload Changeout Rooms (PCR's) on pads A and B at the Kennedy Space Center, Florida. This HEPA filter inspection robot (HFIR) has been designed to be easily deployable and is equipped with the necessary sensory devices, control hardware, software and man-machine interfaces needed to implement HEPA filter inspection reliably and efficiently without damaging the filters or colliding with existing PCR structures or filters. The main purpose of the HFIR is to implement an automated positioning system to move special inspection sensors in pre-defined or manual patterns for the purpose of verifying filter integrity and efficiency. This will ultimately relieve NASA Payload Operations from significant problems associated with time, cost and personnel safety, impacts realized during non-automated PCR HFIR filter certification.
Evaluation of hardware costs of implementing PSK signal detection circuit based on "system on chip"
NASA Astrophysics Data System (ADS)
Sokolovskiy, A. V.; Dmitriev, D. D.; Veisov, E. A.; Gladyshev, A. B.
2018-05-01
The article deals with the choice of the architecture of digital signal processing units for implementing the PSK signal detection scheme. As an assessment of the effectiveness of architectures, the required number of shift registers and computational processes are used when implementing the "system on a chip" on the chip. A statistical estimation of the normalized code sequence offset in the signal synchronization scheme for various hardware block architectures is used.
Computer hardware description languages - A tutorial
NASA Technical Reports Server (NTRS)
Shiva, S. G.
1979-01-01
The paper introduces hardware description languages (HDL) as useful tools for hardware design and documentation. The capabilities and limitations of HDLs are discussed along with the guidelines needed in selecting an appropriate HDL. The directions for future work are provided and attention is given to the implementation of HDLs in microcomputers.
An efficient interpolation filter VLSI architecture for HEVC standard
NASA Astrophysics Data System (ADS)
Zhou, Wei; Zhou, Xin; Lian, Xiaocong; Liu, Zhenyu; Liu, Xiaoxiang
2015-12-01
The next-generation video coding standard of High-Efficiency Video Coding (HEVC) is especially efficient for coding high-resolution video such as 8K-ultra-high-definition (UHD) video. Fractional motion estimation in HEVC presents a significant challenge in clock latency and area cost as it consumes more than 40 % of the total encoding time and thus results in high computational complexity. With aims at supporting 8K-UHD video applications, an efficient interpolation filter VLSI architecture for HEVC is proposed in this paper. Firstly, a new interpolation filter algorithm based on the 8-pixel interpolation unit is proposed in this paper. It can save 19.7 % processing time on average with acceptable coding quality degradation. Based on the proposed algorithm, an efficient interpolation filter VLSI architecture, composed of a reused data path of interpolation, an efficient memory organization, and a reconfigurable pipeline interpolation filter engine, is presented to reduce the implement hardware area and achieve high throughput. The final VLSI implementation only requires 37.2k gates in a standard 90-nm CMOS technology at an operating frequency of 240 MHz. The proposed architecture can be reused for either half-pixel interpolation or quarter-pixel interpolation, which can reduce the area cost for about 131,040 bits RAM. The processing latency of our proposed VLSI architecture can support the real-time processing of 4:2:0 format 7680 × 4320@78fps video sequences.
FPGA-Based, Self-Checking, Fault-Tolerant Computers
NASA Technical Reports Server (NTRS)
Some, Raphael; Rennels, David
2004-01-01
A proposed computer architecture would exploit the capabilities of commercially available field-programmable gate arrays (FPGAs) to enable computers to detect and recover from bit errors. The main purpose of the proposed architecture is to enable fault-tolerant computing in the presence of single-event upsets (SEUs). [An SEU is a spurious bit flip (also called a soft error) caused by a single impact of ionizing radiation.] The architecture would also enable recovery from some soft errors caused by electrical transients and, to some extent, from intermittent and permanent (hard) errors caused by aging of electronic components. A typical FPGA of the current generation contains one or more complete processor cores, memories, and highspeed serial input/output (I/O) channels, making it possible to shrink a board-level processor node to a single integrated-circuit chip. Custom, highly efficient microcontrollers, general-purpose computers, custom I/O processors, and signal processors can be rapidly and efficiently implemented by use of FPGAs. Unfortunately, FPGAs are susceptible to SEUs. Prior efforts to mitigate the effects of SEUs have yielded solutions that degrade performance of the system and require support from external hardware and software. In comparison with other fault-tolerant- computing architectures (e.g., triple modular redundancy), the proposed architecture could be implemented with less circuitry and lower power demand. Moreover, the fault-tolerant computing functions would require only minimal support from circuitry outside the central processing units (CPUs) of computers, would not require any software support, and would be largely transparent to software and to other computer hardware. There would be two types of modules: a self-checking processor module and a memory system (see figure). The self-checking processor module would be implemented on a single FPGA and would be capable of detecting its own internal errors. It would contain two CPUs executing identical programs in lock step, with comparison of their outputs to detect errors. It would also contain various cache local memory circuits, communication circuits, and configurable special-purpose processors that would use self-checking checkers. (The basic principle of the self-checking checker method is to utilize logic circuitry that generates error signals whenever there is an error in either the checker or the circuit being checked.) The memory system would comprise a main memory and a hardware-controlled check-pointing system (CPS) based on a buffer memory denoted the recovery cache. The main memory would contain random-access memory (RAM) chips and FPGAs that would, in addition to everything else, implement double-error-detecting and single-error-correcting memory functions to enable recovery from single-bit errors.
Lemoine, E; Merceron, D; Sallantin, J; Nguifo, E M
1999-01-01
This paper describes a new approach to problem solving by splitting up problem component parts between software and hardware. Our main idea arises from the combination of two previously published works. The first one proposed a conceptual environment of concept modelling in which the machine and the human expert interact. The second one reported an algorithm based on reconfigurable hardware system which outperforms any kind of previously published genetic data base scanning hardware or algorithms. Here we show how efficient the interaction between the machine and the expert is when the concept modelling is based on reconfigurable hardware system. Their cooperation is thus achieved with an real time interaction speed. The designed system has been partially applied to the recognition of primate splice junctions sites in genetic sequences.
Digital echocardiography 2002: now is the time
NASA Technical Reports Server (NTRS)
Thomas, James D.; Greenberg, Neil L.; Garcia, Mario J.
2002-01-01
The ability to acquire echocardiographic images digitally, store and transfer these data using the DICOM standard, and routinely analyze examinations exists today and allows the implementation of a digital echocardiography laboratory. The purpose of this review article is to outline the critical components of a digital echocardiography laboratory, discuss general strategies for implementation, and put forth some of the pitfalls that we have encountered in our own implementation. The major components of the digital laboratory include (1) digital echocardiography machines with network output, (2) a switched high-speed network, (3) a high throughput server with abundant local storage, (4) a reliable low-cost archive, (5) software to manage information, and (6) support mechanisms for software and hardware. Implementation strategies can vary from a complete vendor solution providing all components (hardware, software, support), to a strategy similar to our own where standard computer and networking hardware are used with specialized software for management of image and measurement information.
Chao, Chun-Tang; Maneetien, Nopadon; Wang, Chi-Jo; Chiou, Juing-Shian
2014-01-01
This paper presents the design and evaluation of the hardware circuit for electronic stethoscopes with heart sound cancellation capabilities using field programmable gate arrays (FPGAs). The adaptive line enhancer (ALE) was adopted as the filtering methodology to reduce heart sound attributes from the breath sounds obtained via the electronic stethoscope pickup. FPGAs were utilized to implement the ALE functions in hardware to achieve near real-time breath sound processing. We believe that such an implementation is unprecedented and crucial toward a truly useful, standalone medical device in outpatient clinic settings. The implementation evaluation with one Altera cyclone II-EP2C70F89 shows that the proposed ALE used 45% resources of the chip. Experiments with the proposed prototype were made using DE2-70 emulation board with recorded body signals obtained from online medical archives. Clear suppressions were observed in our experiments from both the frequency domain and time domain perspectives.
The design and implementation of postprocessing for depth map on real-time extraction system.
Tang, Zhiwei; Li, Bin; Li, Huosheng; Xu, Zheng
2014-01-01
Depth estimation becomes the key technology to resolve the communications of the stereo vision. We can get the real-time depth map based on hardware, which cannot implement complicated algorithm as software, because there are some restrictions in the hardware structure. Eventually, some wrong stereo matching will inevitably exist in the process of depth estimation by hardware, such as FPGA. In order to solve the problem a postprocessing function is designed in this paper. After matching cost unique test, the both left-right and right-left consistency check solutions are implemented, respectively; then, the cavities in depth maps can be filled by right depth values on the basis of right-left consistency check solution. The results in the experiments have shown that the depth map extraction and postprocessing function can be implemented in real time in the same system; what is more, the quality of the depth maps is satisfactory.
Floating-Point Units and Algorithms for field-programmable gate arrays
DOE Office of Scientific and Technical Information (OSTI.GOV)
Underwood, Keith D.; Hemmert, K. Scott
2005-11-01
The software that we are attempting to copyright is a package of floating-point unit descriptions and example algorithm implementations using those units for use in FPGAs. The floating point units are best-in-class implementations of add, multiply, divide, and square root floating-point operations. The algorithm implementations are sample (not highly flexible) implementations of FFT, matrix multiply, matrix vector multiply, and dot product. Together, one could think of the collection as an implementation of parts of the BLAS library or something similar to the FFTW packages (without the flexibility) for FPGAs. Results from this work has been published multiple times and wemore » are working on a publication to discuss the techniques we use to implement the floating-point units, For some more background, FPGAS are programmable hardware. "Programs" for this hardware are typically created using a hardware description language (examples include Verilog, VHDL, and JHDL). Our floating-point unit descriptions are written in JHDL, which allows them to include placement constraints that make them highly optimized relative to some other implementations of floating-point units. Many vendors (Nallatech from the UK, SRC Computers in the US) have similar implementations, but our implementations seem to be somewhat higher performance. Our algorithm implementations are written in VHDL and models of the floating-point units are provided in VHDL as well. FPGA "programs" make multiple "calls" (hardware instantiations) to libraries of intellectual property (IP), such as the floating-point unit library described here. These programs are then compiled using a tool called a synthesizer (such as a tool from Synplicity, Inc.). The compiled file is a netlist of gates and flip-flops. This netlist is then mapped to a particular type of FPGA by a mapper and then a place- and-route tool. These tools assign the gates in the netlist to specific locations on the specific type of FPGA chip used and constructs the required routes between them. The result is a "bitstream" that is analogous to a compiled binary. The bitstream is loaded into the FPGA to create a specific hardware configuration.« less
NASA Astrophysics Data System (ADS)
Sutliff, T. J.; Otero, A. M.; Urban, D. L.
2002-01-01
The Physical Sciences Research Program of NASA has chartered a broad suite of peer-reviewed research investigating both fundamental combustion phenomena and applied combustion research topics. Fundamental research provides insights to develop accurate simulations of complex combustion processes and allows developers to improve the efficiency of combustion devices, to reduce the production of harmful emissions, and to reduce the incidence of accidental uncontrolled combustion (fires, explosions). The applied research benefit humans living and working in space through its fire safety program. The Combustion Science Discipline is implementing a structured flight research program utilizing the International Space Station (ISS) and two of its premier facilities, the Combustion Integrated Rack of the Fluids and Combustion Facility and the Microgravity Science Glovebox to conduct this space-based research. This paper reviews the current vision of Combustion Science research planned for International Space Station implementation from 2003 through 2012. A variety of research efforts in droplets and sprays, solid-fuels combustion, and gaseous combustion have been independently selected and critiqued through a series of peer-review processes. During this period, while both the ISS carrier and its research facilities are under development, the Combustion Science Discipline has synergistically combined research efforts into sub-topical areas. To conduct this research aboard ISS in the most cost effective and resource efficient manner, the sub-topic research areas are implemented via a multi-user hardware approach. This paper also summarizes the multi-user hardware approach and recaps the progress made in developing these research hardware systems. A balanced program content has been developed to maximize the production of fundamental and applied combustion research results within the current budgetary and ISS operational resource constraints. Decisions on utilizing the Combustion Integrated Rack and the Microgravity Science Glovebox are made based on facility capabilities and research requirements. To maximize research potential, additional research objectives are specified as desires a priori during the research design phase. These expanded research goals, which are designed to be achievable even with late addition of operational resources, allow additional research of a known, peer-endorsed scope to be conducted at marginal cost. Additional operational resources such as upmass, crewtime, data downlink bandwidth, and stowage volume may be presented by the ISS planners late in the research mission planning process. The Combustion Discipline has put in place plans to be prepared to take full advantage of such opportunities.
NASA Technical Reports Server (NTRS)
Boesen, Michael Reibel; Madsen, Jan; Keymeulen, Didier
2011-01-01
This paper presents the current state of the autonomous dynamically self-organizing and self-healing electronic DNA (eDNA) hardware architecture (patent pending). In its current prototype state, the eDNA architecture is capable of responding to multiple injected faults by autonomously reconfiguring itself to accommodate the fault and keep the application running. This paper will also disclose advanced features currently available in the simulation model only. These features are future work and will soon be implemented in hardware. Finally we will describe step-by-step how an application is implemented on the eDNA architecture.
An embedded controller for a 7-degree of freedom prosthetic arm.
Tenore, Francesco; Armiger, Robert S; Vogelstein, R Jacob; Wenstrand, Douglas S; Harshbarger, Stuart D; Englehart, Kevin
2008-01-01
We present results from an embedded real-time hardware system capable of decoding surface myoelectric signals (sMES) to control a seven degree of freedom upper limb prosthesis. This is one of the first hardware implementations of sMES decoding algorithms and the most advanced controller to-date. We compare decoding results from the device to simulation results from a real-time PC-based operating system. Performance of both systems is shown to be similar, with decoding accuracy greater than 90% for the floating point software simulation and 80% for fixed point hardware and software implementations.
Benchmarking Model Variants in Development of a Hardware-in-the-Loop Simulation System
NASA Technical Reports Server (NTRS)
Aretskin-Hariton, Eliot D.; Zinnecker, Alicia M.; Kratz, Jonathan L.; Culley, Dennis E.; Thomas, George L.
2016-01-01
Distributed engine control architecture presents a significant increase in complexity over traditional implementations when viewed from the perspective of system simulation and hardware design and test. Even if the overall function of the control scheme remains the same, the hardware implementation can have a significant effect on the overall system performance due to differences in the creation and flow of data between control elements. A Hardware-in-the-Loop (HIL) simulation system is under development at NASA Glenn Research Center that enables the exploration of these hardware dependent issues. The system is based on, but not limited to, the Commercial Modular Aero-Propulsion System Simulation 40k (C-MAPSS40k). This paper describes the step-by-step conversion from the self-contained baseline model to the hardware in the loop model, and the validation of each step. As the control model hardware fidelity was improved during HIL system development, benchmarking simulations were performed to verify that engine system performance characteristics remained the same. The results demonstrate the goal of the effort; the new HIL configurations have similar functionality and performance compared to the baseline C-MAPSS40k system.
Image segmentation based upon topological operators: real-time implementation case study
NASA Astrophysics Data System (ADS)
Mahmoudi, R.; Akil, M.
2009-02-01
In miscellaneous applications of image treatment, thinning and crest restoring present a lot of interests. Recommended algorithms for these procedures are those able to act directly over grayscales images while preserving topology. But their strong consummation in term of time remains the major disadvantage in their choice. In this paper we present an efficient hardware implementation on RISC processor of two powerful algorithms of thinning and crest restoring developed by our team. Proposed implementation enhances execution time. A chain of segmentation applied to medical imaging will serve as a concrete example to illustrate the improvements brought thanks to the optimization techniques in both algorithm and architectural levels. The particular use of the SSE instruction set relative to the X86_32 processors (PIV 3.06 GHz) will allow a best performance for real time processing: a cadency of 33 images (512*512) per second is assured.
A method for real-time implementation of HOG feature extraction
NASA Astrophysics Data System (ADS)
Luo, Hai-bo; Yu, Xin-rong; Liu, Hong-mei; Ding, Qing-hai
2011-08-01
Histogram of oriented gradient (HOG) is an efficient feature extraction scheme, and HOG descriptors are feature descriptors which is widely used in computer vision and image processing for the purpose of biometrics, target tracking, automatic target detection(ATD) and automatic target recognition(ATR) etc. However, computation of HOG feature extraction is unsuitable for hardware implementation since it includes complicated operations. In this paper, the optimal design method and theory frame for real-time HOG feature extraction based on FPGA were proposed. The main principle is as follows: firstly, the parallel gradient computing unit circuit based on parallel pipeline structure was designed. Secondly, the calculation of arctangent and square root operation was simplified. Finally, a histogram generator based on parallel pipeline structure was designed to calculate the histogram of each sub-region. Experimental results showed that the HOG extraction can be implemented in a pixel period by these computing units.
DRFM Cordic Processor and Sea Clutter Modeling for Enhancing Structured False Target Synthesis
2017-09-01
was implemented using the Verilog hardware description language. The second investigation concerns generating sea clutter to impose on the false target...to achieve accuracy at 5.625o. The resulting design was implemented using the Verilog hardware description language. The second investigation...33 3. Initialization of the Angle Accumulator ....................................34 4. Design Methodology for I/Q Phase
1989-04-01
existing types of data compression methods amenable to our needs: Huffman, Arithmetic, BSTW, and Lempel - Ziv . The two algorithms with the most modest...APEX architecture. Recently we bega-, investigating various data compression algorithms with character- istics amenable to hardware implementation...This work has so far yielded a variant of the Lempel - Ziv algorithm that adapts continuously to its input and is appropriate to a hardware implementation
Huffman coding in advanced audio coding standard
NASA Astrophysics Data System (ADS)
Brzuchalski, Grzegorz
2012-05-01
This article presents several hardware architectures of Advanced Audio Coding (AAC) Huffman noiseless encoder, its optimisations and working implementation. Much attention has been paid to optimise the demand of hardware resources especially memory size. The aim of design was to get as short binary stream as possible in this standard. The Huffman encoder with whole audio-video system has been implemented in FPGA devices.
Freiberger, Manuel; Egger, Herbert; Liebmann, Manfred; Scharfetter, Hermann
2011-11-01
Image reconstruction in fluorescence optical tomography is a three-dimensional nonlinear ill-posed problem governed by a system of partial differential equations. In this paper we demonstrate that a combination of state of the art numerical algorithms and a careful hardware optimized implementation allows to solve this large-scale inverse problem in a few seconds on standard desktop PCs with modern graphics hardware. In particular, we present methods to solve not only the forward but also the non-linear inverse problem by massively parallel programming on graphics processors. A comparison of optimized CPU and GPU implementations shows that the reconstruction can be accelerated by factors of about 15 through the use of the graphics hardware without compromising the accuracy in the reconstructed images.
Accelerating DNA analysis applications on GPU clusters
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tumeo, Antonino; Villa, Oreste
DNA analysis is an emerging application of high performance bioinformatic. Modern sequencing machinery are able to provide, in few hours, large input streams of data which needs to be matched against exponentially growing databases known fragments. The ability to recognize these patterns effectively and fastly may allow extending the scale and the reach of the investigations performed by biology scientists. Aho-Corasick is an exact, multiple pattern matching algorithm often at the base of this application. High performance systems are a promising platform to accelerate this algorithm, which is computationally intensive but also inherently parallel. Nowadays, high performance systems also includemore » heterogeneous processing elements, such as Graphic Processing Units (GPUs), to further accelerate parallel algorithms. Unfortunately, the Aho-Corasick algorithm exhibits large performance variabilities, depending on the size of the input streams, on the number of patterns to search and on the number of matches, and poses significant challenges on current high performance software and hardware implementations. An adequate mapping of the algorithm on the target architecture, coping with the limit of the underlining hardware, is required to reach the desired high throughputs. Load balancing also plays a crucial role when considering the limited bandwidth among the nodes of these systems. In this paper we present an efficient implementation of the Aho-Corasick algorithm for high performance clusters accelerated with GPUs. We discuss how we partitioned and adapted the algorithm to fit the Tesla C1060 GPU and then present a MPI based implementation for a heterogeneous high performance cluster. We compare this implementation to MPI and MPI with pthreads based implementations for a homogeneous cluster of x86 processors, discussing the stability vs. the performance and the scaling of the solutions, taking into consideration aspects such as the bandwidth among the different nodes.« less
47 CFR 400.7 - Eligible uses for grant funds.
Code of Federal Regulations, 2012 CFR
2012-10-01
... the acquisition and deployment of hardware and software that enables the implementation and operation of Phase II E-911 services, for the acquisition and deployment of hardware and software to enable the migration to an IP-enabled emergency network, for the training in the use of such hardware and software, or...
47 CFR 400.7 - Eligible uses for grant funds.
Code of Federal Regulations, 2013 CFR
2013-10-01
... the acquisition and deployment of hardware and software that enables the implementation and operation of Phase II E-911 services, for the acquisition and deployment of hardware and software to enable the migration to an IP-enabled emergency network, for the training in the use of such hardware and software, or...
47 CFR 400.7 - Eligible uses for grant funds.
Code of Federal Regulations, 2011 CFR
2011-10-01
... the acquisition and deployment of hardware and software that enables the implementation and operation of Phase II E-911 services, for the acquisition and deployment of hardware and software to enable the migration to an IP-enabled emergency network, for the training in the use of such hardware and software, or...
47 CFR 400.7 - Eligible uses for grant funds.
Code of Federal Regulations, 2014 CFR
2014-10-01
... the acquisition and deployment of hardware and software that enables the implementation and operation of Phase II E-911 services, for the acquisition and deployment of hardware and software to enable the migration to an IP-enabled emergency network, for the training in the use of such hardware and software, or...
47 CFR 400.7 - Eligible uses for grant funds.
Code of Federal Regulations, 2010 CFR
2010-10-01
... the acquisition and deployment of hardware and software that enables the implementation and operation of Phase II E-911 services, for the acquisition and deployment of hardware and software to enable the migration to an IP-enabled emergency network, for the training in the use of such hardware and software, or...
NASA Technical Reports Server (NTRS)
Schoenwald, Adam J.; Bradley, Damon C.; Mohammed, Priscilla N.; Piepmeier, Jeffrey R.; Wong, Mark
2016-01-01
Radio-frequency interference (RFI) is a known problem for passive remote sensing as evidenced in the L-band radiometers SMOS, Aquarius and more recently, SMAP. Various algorithms have been developed and implemented on SMAP to improve science measurements. This was achieved by the use of a digital microwave radiometer. RFI mitigation becomes more challenging for microwave radiometers operating at higher frequencies in shared allocations. At higher frequencies larger bandwidths are also desirable for lower measurement noise further adding to processing challenges. This work focuses on finding improved RFI mitigation techniques that will be effective at additional frequencies and at higher bandwidths. To aid the development and testing of applicable detection and mitigation techniques, a wide-band RFI algorithm testing environment has been developed using the Reconfigurable Open Architecture Computing Hardware System (ROACH) built by the Collaboration for Astronomy Signal Processing and Electronics Research (CASPER) Group. The testing environment also consists of various test equipment used to reproduce typical signals that a radiometer may see including those with and without RFI. The testing environment permits quick evaluations of RFI mitigation algorithms as well as show that they are implementable in hardware. The algorithm implemented is a complex signal kurtosis detector which was modeled and simulated. The complex signal kurtosis detector showed improved performance over the real kurtosis detector under certain conditions. The real kurtosis is implemented on SMAP at 24 MHz bandwidth. The complex signal kurtosis algorithm was then implemented in hardware at 200 MHz bandwidth using the ROACH. In this work, performance of the complex signal kurtosis and the real signal kurtosis are compared. Performance evaluations and comparisons in both simulation as well as experimental hardware implementations were done with the use of receiver operating characteristic (ROC) curves. The complex kurtosis algorithm has the potential to reduce data rate due to onboard processing in addition to improving RFI detection performance.
NASA Technical Reports Server (NTRS)
Pang, Jackson; Liddicoat, Albert; Ralston, Jesse; Pingree, Paula
2006-01-01
The current implementation of the Telecommunications Protocol Processing Subsystem Using Reconfigurable Interoperable Gate Arrays (TRIGA) is equipped with CFDP protocol and CCSDS Telemetry and Telecommand framing schemes to replace the CPU intensive software counterpart implementation for reliable deep space communication. We present the hardware/software co-design methodology used to accomplish high data rate throughput. The hardware CFDP protocol stack implementation is then compared against the two recent flight implementations. The results from our experiments show that TRIGA offers more than 3 orders of magnitude throughput improvement with less than one-tenth of the power consumption.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kumar, Dinesh; Thapliyal, Himanshu; Mohammad, Azhar
Differential Power Analysis (DPA) attack is considered to be a main threat while designing cryptographic processors. In cryptographic algorithms like DES and AES, S-Box is used to indeterminate the relationship between the keys and the cipher texts. However, S-box is prone to DPA attack due to its high power consumption. In this paper, we are implementing an energy-efficient 8-bit S-Box circuit using our proposed Symmetric Pass Gate Adiabatic Logic (SPGAL). SPGAL is energy-efficient as compared to the existing DPAresistant adiabatic and non-adiabatic logic families. SPGAL is energy-efficient due to reduction of non-adiabatic loss during the evaluate phase of the outputs.more » Further, the S-Box circuit implemented using SPGAL is resistant to DPA attacks. The results are verified through SPICE simulations in 180nm technology. SPICE simulations show that the SPGAL based S-Box circuit saves upto 92% and 67% of energy as compared to the conventional CMOS and Secured Quasi-Adiabatic Logic (SQAL) based S-Box circuit. From the simulation results, it is evident that the SPGAL based circuits are energy-efficient as compared to the existing DPAresistant adiabatic and non-adiabatic logic families. In nutshell, SPGAL based gates can be used to build secure hardware for lowpower portable electronic devices and Internet-of-Things (IoT) based electronic devices.« less
Low complexity 1D IDCT for 16-bit parallel architectures
NASA Astrophysics Data System (ADS)
Bivolarski, Lazar
2007-09-01
This paper shows that using the Loeffler, Ligtenberg, and Moschytz factorization of 8-point IDCT [2] one-dimensional (1-D) algorithm as a fast approximation of the Discrete Cosine Transform (DCT) and using only 16 bit numbers, it is possible to create in an IEEE 1180-1990 compliant and multiplierless algorithm with low computational complexity. This algorithm as characterized by its structure is efficiently implemented on parallel high performance architectures as well as due to its low complexity is sufficient for wide range of other architectures. Additional constraint on this work was the requirement of compliance with the existing MPEG standards. The hardware implementation complexity and low resources where also part of the design criteria for this algorithm. This implementation is also compliant with the precision requirements described in MPEG IDCT precision specification ISO/IEC 23002-1. Complexity analysis is performed as an extension to the simple measure of shifts and adds for the multiplierless algorithm as additional operations are included in the complexity measure to better describe the actual transform implementation complexity.
Hierarchical MFMO Circuit Modules for an Energy-Efficient SDR DBF
NASA Astrophysics Data System (ADS)
Mar, Jeich; Kuo, Chi-Cheng; Wu, Shin-Ru; Lin, You-Rong
The hierarchical multi-function matrix operation (MFMO) circuit modules are designed using coordinate rotations digital computer (CORDIC) algorithm for realizing the intensive computation of matrix operations. The paper emphasizes that the designed hierarchical MFMO circuit modules can be used to develop a power-efficient software-defined radio (SDR) digital beamformer (DBF). The formulas of the processing time for the scalable MFMO circuit modules implemented in field programmable gate array (FPGA) are derived to allocate the proper logic resources for the hardware reconfiguration. The hierarchical MFMO circuit modules are scalable to the changing number of array branches employed for the SDR DBF to achieve the purpose of power saving. The efficient reuse of the common MFMO circuit modules in the SDR DBF can also lead to energy reduction. Finally, the power dissipation and reconfiguration function in the different modes of the SDR DBF are observed from the experiment results.
Development of noSQL data storage for the ATLAS PanDA Monitoring System
NASA Astrophysics Data System (ADS)
Potekhin, M.; ATLAS Collaboration
2012-06-01
For several years the PanDA Workload Management System has been the basis for distributed production and analysis for the ATLAS experiment at the LHC. Since the start of data taking PanDA usage has ramped up steadily, typically exceeding 500k completed jobs/day by June 2011. The associated monitoring data volume has been rising as well, to levels that present a new set of challenges in the areas of database scalability and monitoring system performance and efficiency. These challenges are being met with a R&D effort aimed at implementing a scalable and efficient monitoring data storage based on a noSQL solution (Cassandra). We present our motivations for using this technology, as well as data design and the techniques used for efficient indexing of the data. We also discuss the hardware requirements as they were determined by testing with actual data and realistic loads.
Toward a more efficient and scalable checkpoint/restart mechanism in the Community Atmosphere Model
NASA Astrophysics Data System (ADS)
Anantharaj, Valentine
2015-04-01
The number of cores (both CPU as well as accelerator) in large-scale systems has been increasing rapidly over the past several years. In 2008, there were only 5 systems in the Top500 list that had over 100,000 total cores (including accelerator cores) whereas the number of system with such capability has jumped to 31 in Nov 2014. This growth however has also increased the risk of hardware failure rates, necessitating the implementation of fault tolerance mechanism in applications. The checkpoint and restart (C/R) approach is commonly used to save the state of the application and restart at a later time either after failure or to continue execution of experiments. The implementation of an efficient C/R mechanism will make it more affordable to output the necessary C/R files more frequently. The availability of larger systems (more nodes, memory and cores) has also facilitated the scaling of applications. Nowadays, it is more common to conduct coupled global climate simulation experiments at 1 deg horizontal resolution (atmosphere), often requiring about 103 cores. At the same time, a few climate modeling teams that have access to a dedicated cluster and/or large scale systems are involved in modeling experiments at 0.25 deg horizontal resolution (atmosphere) and 0.1 deg resolution for the ocean. These ultrascale configurations require the order of 104 to 105 cores. It is not only necessary for the numerical algorithms to scale efficiently but the input/output (IO) mechanism must also scale accordingly. An ongoing series of ultrascale climate simulations, using the Titan supercomputer at the Oak Ridge Leadership Computing Facility (ORNL), is based on the spectral element dynamical core of the Community Atmosphere Model (CAM-SE), which is a component of the Community Earth System Model and the DOE Accelerated Climate Model for Energy (ACME). The CAM-SE dynamical core for a 0.25 deg configuration has been shown to scale efficiently across 100,000 cpu cores. At this scale, there is an increased risk that the simulation could be terminated due to hardware failures, resulting in a loss that could be as high as 105 - 106 titan core hours. Increasing the frequency of the output of C/R files could mitigate this loss but at the cost of additional C/R overhead. We are testing a more efficient C/R mechanism in CAM-SE. Our early implementation has demonstrated a nearly 3X performance improvement for a 1 deg CAM-SE (with CAM5 physics and MOZART chemistry) configuration using nearly 103 cores. We are in the process of scaling our implementation to 105 cores. This would allow us to run ultra scale simulations with more sophisticated physics and chemistry options while making better utilization of resources.
A New Method for Control of the Efficiency of Gear Reducers
NASA Astrophysics Data System (ADS)
E Kozlov, K.; Egorov, A. V.; Belogusev, V. N.
2017-04-01
This article proposes a new method to control the energy efficiency of gear reducers. The method allows evaluating the friction losses in a drive motor, drive motor bearing assemblies, and toothing both at the stage of control of the finished product and in the course of its operation, maintenance, and repair. The proposed method, unlike currently used methods for control of the efficiency of gear reducers, allows determining the friction losses without the use of strain measurement, which requires calibration of tensometric sensors and expensive equipment. The method is based on the idea of invariability of mechanical characteristics of the induction motor at constant voltage, resistance of windings, and mains frequency, regardless of the driven inertia mass. This paper presents experimental results which verify the theoretical predictions. The proposed method can be implemented in the procedure of acceptance test at the companies that manufacture gear reducers, thereby assess their effectiveness and the level of degradation processes that significantly affect the service life of the research object. The method can be implemented both with universal and with specialized hardware and software complexes. At that, both an increment of the inertia moment and acceleration time of a gear reducer may serve as a performance criterion.
Learning in Neural Networks: VLSI Implementation Strategies
NASA Technical Reports Server (NTRS)
Duong, Tuan Anh
1995-01-01
Fully-parallel hardware neural network implementations may be applied to high-speed recognition, classification, and mapping tasks in areas such as vision, or can be used as low-cost self-contained units for tasks such as error detection in mechanical systems (e.g. autos). Learning is required not only to satisfy application requirements, but also to overcome hardware-imposed limitations such as reduced dynamic range of connections.
Tunable, Flexible, and Efficient Optimization of Control Pulses for Practical Qubits
NASA Astrophysics Data System (ADS)
Machnes, Shai; Assémat, Elie; Tannor, David; Wilhelm, Frank K.
2018-04-01
Quantum computation places very stringent demands on gate fidelities, and experimental implementations require both the controls and the resultant dynamics to conform to hardware-specific constraints. Superconducting qubits present the additional requirement that pulses must have simple parameterizations, so they can be further calibrated in the experiment, to compensate for uncertainties in system parameters. Other quantum technologies, such as sensing, require extremely high fidelities. We present a novel, conceptually simple and easy-to-implement gradient-based optimal control technique named gradient optimization of analytic controls (GOAT), which satisfies all the above requirements, unlike previous approaches. To demonstrate GOAT's capabilities, with emphasis on flexibility and ease of subsequent calibration, we optimize fast coherence-limited pulses for two leading superconducting qubits architectures—flux-tunable transmons and fixed-frequency transmons with tunable couplers.
Systolic array processing of the sequential decoding algorithm
NASA Technical Reports Server (NTRS)
Chang, C. Y.; Yao, K.
1989-01-01
A systolic array processing technique is applied to implementing the stack algorithm form of the sequential decoding algorithm. It is shown that sorting, a key function in the stack algorithm, can be efficiently realized by a special type of systolic arrays known as systolic priority queues. Compared to the stack-bucket algorithm, this approach is shown to have the advantages that the decoding always moves along the optimal path, that it has a fast and constant decoding speed and that its simple and regular hardware architecture is suitable for VLSI implementation. Three types of systolic priority queues are discussed: random access scheme, shift register scheme and ripple register scheme. The property of the entries stored in the systolic priority queue is also investigated. The results are applicable to many other basic sorting type problems.
FPGA Implementation of Generalized Hebbian Algorithm for Texture Classification
Lin, Shiow-Jyu; Hwang, Wen-Jyi; Lee, Wei-Hao
2012-01-01
This paper presents a novel hardware architecture for principal component analysis. The architecture is based on the Generalized Hebbian Algorithm (GHA) because of its simplicity and effectiveness. The architecture is separated into three portions: the weight vector updating unit, the principal computation unit and the memory unit. In the weight vector updating unit, the computation of different synaptic weight vectors shares the same circuit for reducing the area costs. To show the effectiveness of the circuit, a texture classification system based on the proposed architecture is physically implemented by Field Programmable Gate Array (FPGA). It is embedded in a System-On-Programmable-Chip (SOPC) platform for performance measurement. Experimental results show that the proposed architecture is an efficient design for attaining both high speed performance and low area costs. PMID:22778640
(abstract) A High Throughput 3-D Inner Product Processor
NASA Technical Reports Server (NTRS)
Daud, Tuan
1996-01-01
A particularily challenging image processing application is the real time scene acquisition and object discrimination. It requires spatio-temporal recognition of point and resolved objects at high speeds with parallel processing algorithms. Neural network paradigms provide fine grain parallism and, when implemented in hardware, offer orders of magnitude speed up. However, neural networks implemented on a VLSI chip are planer architectures capable of efficient processing of linear vector signals rather than 2-D images. Therefore, for processing of images, a 3-D stack of neural-net ICs receiving planar inputs and consuming minimal power are required. Details of the circuits with chip architectures will be described with need to develop ultralow-power electronics. Further, use of the architecture in a system for high-speed processing will be illustrated.
ACE: Automatic Centroid Extractor for real time target tracking
NASA Technical Reports Server (NTRS)
Cameron, K.; Whitaker, S.; Canaris, J.
1990-01-01
A high performance video image processor has been implemented which is capable of grouping contiguous pixels from a raster scan image into groups and then calculating centroid information for each object in a frame. The algorithm employed to group pixels is very efficient and is guaranteed to work properly for all convex shapes as well as most concave shapes. Processing speeds are adequate for real time processing of video images having a pixel rate of up to 20 million pixels per second. Pixels may be up to 8 bits wide. The processor is designed to interface directly to a transputer serial link communications channel with no additional hardware. The full custom VLSI processor was implemented in a 1.6 mu m CMOS process and measures 7200 mu m on a side.
Lattice surgery on the Raussendorf lattice
NASA Astrophysics Data System (ADS)
Herr, Daniel; Paler, Alexandru; Devitt, Simon J.; Nori, Franco
2018-07-01
Lattice surgery is a method to perform quantum computation fault-tolerantly by using operations on boundary qubits between different patches of the planar code. This technique allows for universal planar code computation without eliminating the intrinsic two-dimensional nearest-neighbor properties of the surface code that eases physical hardware implementations. Lattice surgery approaches to algorithmic compilation and optimization have been demonstrated to be more resource efficient for resource-intensive components of a fault-tolerant algorithm, and consequently may be preferable over braid-based logic. Lattice surgery can be extended to the Raussendorf lattice, providing a measurement-based approach to the surface code. In this paper we describe how lattice surgery can be performed on the Raussendorf lattice and therefore give a viable alternative to computation using braiding in measurement-based implementations of topological codes.
Hardware Evolution of Control Electronics
NASA Technical Reports Server (NTRS)
Gwaltney, David; Steincamp, Jim; Corder, Eric; King, Ken; Ferguson, M. I.; Dutton, Ken
2003-01-01
The evolution of closed-loop motor speed controllers implemented on the JPL FPTA2 is presented. The response of evolved controller to sinusoidal commands, controller reconfiguration for fault tolerance,and hardware evolution are described.
Web survey data collection and retrieval to plan teleradiology implementation
NASA Astrophysics Data System (ADS)
Alaoui, Adil; Collmann, Jeff R.; Johnson, Jeffrey A.; Lindisch, David; Nguyen, Dan; Mun, Seong K.
2003-05-01
This case study details the experience of system engineers of the Imaging Science and Information Systems Center, Georgetown University Medical Center (ISIS) and radiologists from the department of Radiology in the implementation of a new Teleradiology system. The Teleradiology system enables radiologists to view medical images from remote sites under those circumstances where a resident radiologist needs assistance in evaluating the images after hours and during weekends; it also enables clinicians access to patients" medical images from different workstations within the hospital. The Implementation of the Teleradiology project was preceded by an evaluation phase to perform testing, gather users feedback using a web site and collect information that helped eliminate system bugs, complete recommendations regarding minimum hardware configuration and bandwidth and enhance system"s functions, this phase included a survey-based system assessment of computer configurations, Internet connections, problem identification, and recommendations for improvement, and a testing period with 2 radiologists and ISIS engineers; The second phase was designed to launch the system and make it available to all attending radiologists in the department. To accomplish the first phase of the project a web site was designed and ASP pages were created to enable users to securely logon and enter feedback and recommendations into an SQL database. This efficient, accurate data flow alleviated networking, software and hardware problems. Corrective recommendations were immediately forwarded to the software vendor. The vendor responded with software updates that better met the needs of the radiologists. The ISIS Center completed recommendations for minimum hardware and bandwidth requirements. This experience illustrates that the approach used in collecting the data and facilitating the teamwork between the system engineers and radiologists was instrumental in the project"s success. Major problems with the Teleradiology system were discovered and remedied early by linking the actual practice experience of the physicians to the system improvements.
Rapid algorithm prototyping and implementation for power quality measurement
NASA Astrophysics Data System (ADS)
Kołek, Krzysztof; Piątek, Krzysztof
2015-12-01
This article presents a Model-Based Design (MBD) approach to rapidly implement power quality (PQ) metering algorithms. Power supply quality is a very important aspect of modern power systems and will become even more important in future smart grids. In this case, maintaining the PQ parameters at the desired level will require efficient implementation methods of the metering algorithms. Currently, the development of new, advanced PQ metering algorithms requires new hardware with adequate computational capability and time intensive, cost-ineffective manual implementations. An alternative, considered here, is an MBD approach. The MBD approach focuses on the modelling and validation of the model by simulation, which is well-supported by a Computer-Aided Engineering (CAE) packages. This paper presents two algorithms utilized in modern PQ meters: a phase-locked loop based on an Enhanced Phase Locked Loop (EPLL), and the flicker measurement according to the IEC 61000-4-15 standard. The algorithms were chosen because of their complexity and non-trivial development. They were first modelled in the MATLAB/Simulink package, then tested and validated in a simulation environment. The models, in the form of Simulink diagrams, were next used to automatically generate C code. The code was compiled and executed in real-time on the Zynq Xilinx platform that combines a reconfigurable Field Programmable Gate Array (FPGA) with a dual-core processor. The MBD development of PQ algorithms, automatic code generation, and compilation form a rapid algorithm prototyping and implementation path for PQ measurements. The main advantage of this approach is the ability to focus on the design, validation, and testing stages while skipping over implementation issues. The code generation process renders production-ready code that can be easily used on the target hardware. This is especially important when standards for PQ measurement are in constant development, and the PQ issues in emerging smart grids will require tools for rapid development and implementation of such algorithms.
Selected Systems Engineering Process Deficiencies and Their Consequences
NASA Technical Reports Server (NTRS)
Thomas, Lawrence Dale
2006-01-01
The systems engineering process is well established and well understood. While this statement could be argued in the light of the many systems engineering guidelines and that have been developed, comparative review of these respective descriptions reveal that they differ primarily in the number of discrete steps or other nuances, and are at their core essentially common. Likewise, the systems engineering textbooks differ primarily in the context for application of systems engineering or in the utilization of evolved tools and techniques, not in the basic method. Thus, failures in systems engineering cannot credibly be attributed to implementation of the wrong systems engineering process among alternatives. However, numerous systems failures can be attributed to deficient implementation of the systems engineering process. What may clearly be perceived as a system engineering deficiency in retrospect can appear to be a well considered system engineering efficiency in real time - an efficiency taken to reduce cost or meet a schedule, or more often both. Typically these efficiencies are grounded on apparently solid rationale, such as reuse of heritage hardware or software. Over time, unintended consequences of a systems engineering process deficiency may begin to be realized, and unfortunately often the consequence is system failure. This paper describes several actual cases of system failures that resulted from deficiencies in their systems engineering process implementation, including the Ariane 5 and the Hubble Space Telescope.
Selected systems engineering process deficiencies and their consequences
NASA Astrophysics Data System (ADS)
Thomas, L. Dale
2007-06-01
The systems engineering process is well established and well understood. While this statement could be argued in the light of the many systems engineering guidelines and that have been developed, comparative review of these respective descriptions reveal that they differ primarily in the number of discrete steps or other nuances, and are at their core essentially common. Likewise, the systems engineering textbooks differ primarily in the context for application of systems engineering or in the utilization of evolved tools and techniques, not in the basic method. Thus, failures in systems engineering cannot credibly be attributed to implementation of the wrong systems engineering process among alternatives. However, numerous system failures can be attributed to deficient implementation of the systems engineering process. What may clearly be perceived as a systems engineering deficiency in retrospect can appear to be a well considered system engineering efficiency in real time—an efficiency taken to reduce cost or meet a schedule, or more often both. Typically these efficiencies are grounded on apparently solid rationale, such as reuse of heritage hardware or software. Over time, unintended consequences of a systems engineering process deficiency may begin to be realized, and unfortunately often the consequence is systems failure. This paper describes several actual cases of system failures that resulted from deficiencies in their systems engineering process implementation, including the Ariane 5 and the Hubble Space Telescope.
Effects of pulse width and coding on radar returns from clear air
NASA Technical Reports Server (NTRS)
Cornish, C. R.
1983-01-01
In atmospheric radar studies it is desired to obtain maximum information about the atmosphere and to use efficiently the radar transmitter and processing hardware. Large pulse widths are used to increase the signal to noise ratio since clear air returns are generally weak and maximum height coverage is desired. Yet since good height resolution is equally important, pulse compression techniques such as phase coding are employed to optimize the average power of the transmitter. Considerations in implementing a coding scheme and subsequent effects of an impinging pulse on the atmosphere are investigated.
Electrically reconfigurable logic array
NASA Technical Reports Server (NTRS)
Agarwal, R. K.
1982-01-01
To compose the complicated systems using algorithmically specialized logic circuits or processors, one solution is to perform relational computations such as union, division and intersection directly on hardware. These relations can be pipelined efficiently on a network of processors having an array configuration. These processors can be designed and implemented with a few simple cells. In order to determine the state-of-the-art in Electrically Reconfigurable Logic Array (ERLA), a survey of the available programmable logic array (PLA) and the logic circuit elements used in such arrays was conducted. Based on this survey some recommendations are made for ERLA devices.
Advances in Projection Moire Interferometry Development for Large Wind Tunnel Applications
NASA Technical Reports Server (NTRS)
Fleming, Gary A.; Soto, Hector L.; South, Bruce W.; Bartram, Scott M.
1999-01-01
An instrument development program aimed at using Projection Moire Interferometry (PMI) for acquiring model deformation measurements in large wind tunnels was begun at NASA Langley Research Center in 1996. Various improvements to the initial prototype PMI systems have been made throughout this development effort. This paper documents several of the most significant improvements to the optical hardware and image processing software, and addresses system implementation issues for large wind tunnel applications. The improvements have increased both measurement accuracy and instrument efficiency, promoting the routine use of PMI for model deformation measurements in production wind tunnel tests.
Generating higher-order quantum dissipation from lower-order parametric processes
NASA Astrophysics Data System (ADS)
Mundhada, S. O.; Grimm, A.; Touzard, S.; Vool, U.; Shankar, S.; Devoret, M. H.; Mirrahimi, M.
2017-06-01
The stabilisation of quantum manifolds is at the heart of error-protected quantum information storage and manipulation. Nonlinear driven-dissipative processes achieve such stabilisation in a hardware efficient manner. Josephson circuits with parametric pump drives implement these nonlinear interactions. In this article, we propose a scheme to engineer a four-photon drive and dissipation on a harmonic oscillator by cascading experimentally demonstrated two-photon processes. This would stabilise a four-dimensional degenerate manifold in a superconducting resonator. We analyse the performance of the scheme using numerical simulations of a realisable system with experimentally achievable parameters.
Biamonte, Jacob; Wittek, Peter; Pancotti, Nicola; Rebentrost, Patrick; Wiebe, Nathan; Lloyd, Seth
2017-09-13
Fuelled by increasing computer power and algorithmic advances, machine learning techniques have become powerful tools for finding patterns in data. Quantum systems produce atypical patterns that classical systems are thought not to produce efficiently, so it is reasonable to postulate that quantum computers may outperform classical computers on machine learning tasks. The field of quantum machine learning explores how to devise and implement quantum software that could enable machine learning that is faster than that of classical computers. Recent work has produced quantum algorithms that could act as the building blocks of machine learning programs, but the hardware and software challenges are still considerable.
Planning for Materials Processing in Space
NASA Technical Reports Server (NTRS)
1977-01-01
A systems design study to describe the conceptual evolution, the institutional interrelationshiphs, and the basic physical requirements to implement materials processing in space was conducted. Planning for a processing era, rather than hardware design, was emphasized. Product development in space was examined in terms of fluid phenomena, phase separation, and heat and mass transfer. The effect of materials processing on the environment was studied. A concept for modular, unmanned orbiting facilities using the modified external tank of the space shuttle is presented. Organizational and finding structures which would provide for the efficient movement of materials from user to space are discussed.
NASA Astrophysics Data System (ADS)
Biamonte, Jacob; Wittek, Peter; Pancotti, Nicola; Rebentrost, Patrick; Wiebe, Nathan; Lloyd, Seth
2017-09-01
Fuelled by increasing computer power and algorithmic advances, machine learning techniques have become powerful tools for finding patterns in data. Quantum systems produce atypical patterns that classical systems are thought not to produce efficiently, so it is reasonable to postulate that quantum computers may outperform classical computers on machine learning tasks. The field of quantum machine learning explores how to devise and implement quantum software that could enable machine learning that is faster than that of classical computers. Recent work has produced quantum algorithms that could act as the building blocks of machine learning programs, but the hardware and software challenges are still considerable.
Srinivasa, Narayan; Zhang, Deying; Grigorian, Beayna
2014-03-01
This paper describes a novel architecture for enabling robust and efficient neuromorphic communication. The architecture combines two concepts: 1) synaptic time multiplexing (STM) that trades space for speed of processing to create an intragroup communication approach that is firing rate independent and offers more flexibility in connectivity than cross-bar architectures and 2) a wired multiple input multiple output (MIMO) communication with orthogonal frequency division multiplexing (OFDM) techniques to enable a robust and efficient intergroup communication for neuromorphic systems. The MIMO-OFDM concept for the proposed architecture was analyzed by simulating large-scale spiking neural network architecture. Analysis shows that the neuromorphic system with MIMO-OFDM exhibits robust and efficient communication while operating in real time with a high bit rate. Through combining STM with MIMO-OFDM techniques, the resulting system offers a flexible and scalable connectivity as well as a power and area efficient solution for the implementation of very large-scale spiking neural architectures in hardware.
Convolutional networks for fast, energy-efficient neuromorphic computing.
Esser, Steven K; Merolla, Paul A; Arthur, John V; Cassidy, Andrew S; Appuswamy, Rathinakumar; Andreopoulos, Alexander; Berg, David J; McKinstry, Jeffrey L; Melano, Timothy; Barch, Davis R; di Nolfo, Carmelo; Datta, Pallab; Amir, Arnon; Taba, Brian; Flickner, Myron D; Modha, Dharmendra S
2016-10-11
Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on spiking neurons, low precision synapses, and a scalable communication network. Here, we demonstrate that neuromorphic computing, despite its novel architectural primitives, can implement deep convolution networks that (i) approach state-of-the-art classification accuracy across eight standard datasets encompassing vision and speech, (ii) perform inference while preserving the hardware's underlying energy-efficiency and high throughput, running on the aforementioned datasets at between 1,200 and 2,600 frames/s and using between 25 and 275 mW (effectively >6,000 frames/s per Watt), and (iii) can be specified and trained using backpropagation with the same ease-of-use as contemporary deep learning. This approach allows the algorithmic power of deep learning to be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer.
Efficient space-time sampling with pixel-wise coded exposure for high-speed imaging.
Liu, Dengyu; Gu, Jinwei; Hitomi, Yasunobu; Gupta, Mohit; Mitsunaga, Tomoo; Nayar, Shree K
2014-02-01
Cameras face a fundamental trade-off between spatial and temporal resolution. Digital still cameras can capture images with high spatial resolution, but most high-speed video cameras have relatively low spatial resolution. It is hard to overcome this trade-off without incurring a significant increase in hardware costs. In this paper, we propose techniques for sampling, representing, and reconstructing the space-time volume to overcome this trade-off. Our approach has two important distinctions compared to previous works: 1) We achieve sparse representation of videos by learning an overcomplete dictionary on video patches, and 2) we adhere to practical hardware constraints on sampling schemes imposed by architectures of current image sensors, which means that our sampling function can be implemented on CMOS image sensors with modified control units in the future. We evaluate components of our approach, sampling function and sparse representation, by comparing them to several existing approaches. We also implement a prototype imaging system with pixel-wise coded exposure control using a liquid crystal on silicon device. System characteristics such as field of view and modulation transfer function are evaluated for our imaging system. Both simulations and experiments on a wide range of scenes show that our method can effectively reconstruct a video from a single coded image while maintaining high spatial resolution.
An FPGA Platform for Real-Time Simulation of Spiking Neuronal Networks
Pani, Danilo; Meloni, Paolo; Tuveri, Giuseppe; Palumbo, Francesca; Massobrio, Paolo; Raffo, Luigi
2017-01-01
In the last years, the idea to dynamically interface biological neurons with artificial ones has become more and more urgent. The reason is essentially due to the design of innovative neuroprostheses where biological cell assemblies of the brain can be substituted by artificial ones. For closed-loop experiments with biological neuronal networks interfaced with in silico modeled networks, several technological challenges need to be faced, from the low-level interfacing between the living tissue and the computational model to the implementation of the latter in a suitable form for real-time processing. Field programmable gate arrays (FPGAs) can improve flexibility when simple neuronal models are required, obtaining good accuracy, real-time performance, and the possibility to create a hybrid system without any custom hardware, just programming the hardware to achieve the required functionality. In this paper, this possibility is explored presenting a modular and efficient FPGA design of an in silico spiking neural network exploiting the Izhikevich model. The proposed system, prototypically implemented on a Xilinx Virtex 6 device, is able to simulate a fully connected network counting up to 1,440 neurons, in real-time, at a sampling rate of 10 kHz, which is reasonable for small to medium scale extra-cellular closed-loop experiments. PMID:28293163
PREMIX: PRivacy-preserving EstiMation of Individual admiXture.
Chen, Feng; Dow, Michelle; Ding, Sijie; Lu, Yao; Jiang, Xiaoqian; Tang, Hua; Wang, Shuang
2016-01-01
In this paper we proposed a framework: PRivacy-preserving EstiMation of Individual admiXture (PREMIX) using Intel software guard extensions (SGX). SGX is a suite of software and hardware architectures to enable efficient and secure computation over confidential data. PREMIX enables multiple sites to securely collaborate on estimating individual admixture within a secure enclave inside Intel SGX. We implemented a feature selection module to identify most discriminative Single Nucleotide Polymorphism (SNP) based on informativeness and an Expectation Maximization (EM)-based Maximum Likelihood estimator to identify the individual admixture. Experimental results based on both simulation and 1000 genome data demonstrated the efficiency and accuracy of the proposed framework. PREMIX ensures a high level of security as all operations on sensitive genomic data are conducted within a secure enclave using SGX.
Enhanced autocompensating quantum cryptography system.
Bethune, Donald S; Navarro, Martha; Risk, William P
2002-03-20
We have improved the hardware and software of our autocompensating system for quantum key distribution by replacing bulk optical components at the end stations with fiber-optic equivalents and implementing software that synchronizes end-station activities, communicates basis choices, corrects errors, and performs privacy amplification over a local area network. The all-fiber-optic arrangement provides stable, efficient, and high-contrast routing of the photons. The low-bit error rate leads to high error-correction efficiency and minimizes data sacrifice during privacy amplification. Characterization measurements made on a number of commercial avalanche photodiodes are presented that highlight the need for improved devices tailored specifically for quantum information applications. A scheme for frequency shifting the photons returning from Alice's station to allow them to be distinguished from backscattered noise photons is also described.
Pulse-coupled neural network implementation in FPGA
NASA Astrophysics Data System (ADS)
Waldemark, Joakim T. A.; Lindblad, Thomas; Lindsey, Clark S.; Waldemark, Karina E.; Oberg, Johnny; Millberg, Mikael
1998-03-01
Pulse Coupled Neural Networks (PCNN) are biologically inspired neural networks, mainly based on studies of the visual cortex of small mammals. The PCNN is very well suited as a pre- processor for image processing, particularly in connection with object isolation, edge detection and segmentation. Several implementations of PCNN on von Neumann computers, as well as on special parallel processing hardware devices (e.g. SIMD), exist. However, these implementations are not as flexible as required for many applications. Here we present an implementation in Field Programmable Gate Arrays (FPGA) together with a performance analysis. The FPGA hardware implementation may be considered a platform for further, extended implementations and easily expanded into various applications. The latter may include advanced on-line image analysis with close to real-time performance.
Eastman, Peter; Friedrichs, Mark S; Chodera, John D; Radmer, Randall J; Bruns, Christopher M; Ku, Joy P; Beauchamp, Kyle A; Lane, Thomas J; Wang, Lee-Ping; Shukla, Diwakar; Tye, Tony; Houston, Mike; Stich, Timo; Klein, Christoph; Shirts, Michael R; Pande, Vijay S
2013-01-08
OpenMM is a software toolkit for performing molecular simulations on a range of high performance computing architectures. It is based on a layered architecture: the lower layers function as a reusable library that can be invoked by any application, while the upper layers form a complete environment for running molecular simulations. The library API hides all hardware-specific dependencies and optimizations from the users and developers of simulation programs: they can be run without modification on any hardware on which the API has been implemented. The current implementations of OpenMM include support for graphics processing units using the OpenCL and CUDA frameworks. In addition, OpenMM was designed to be extensible, so new hardware architectures can be accommodated and new functionality (e.g., energy terms and integrators) can be easily added.
Eastman, Peter; Friedrichs, Mark S.; Chodera, John D.; Radmer, Randall J.; Bruns, Christopher M.; Ku, Joy P.; Beauchamp, Kyle A.; Lane, Thomas J.; Wang, Lee-Ping; Shukla, Diwakar; Tye, Tony; Houston, Mike; Stich, Timo; Klein, Christoph; Shirts, Michael R.; Pande, Vijay S.
2012-01-01
OpenMM is a software toolkit for performing molecular simulations on a range of high performance computing architectures. It is based on a layered architecture: the lower layers function as a reusable library that can be invoked by any application, while the upper layers form a complete environment for running molecular simulations. The library API hides all hardware-specific dependencies and optimizations from the users and developers of simulation programs: they can be run without modification on any hardware on which the API has been implemented. The current implementations of OpenMM include support for graphics processing units using the OpenCL and CUDA frameworks. In addition, OpenMM was designed to be extensible, so new hardware architectures can be accommodated and new functionality (e.g., energy terms and integrators) can be easily added. PMID:23316124
Parallel ICA and its hardware implementation in hyperspectral image analysis
NASA Astrophysics Data System (ADS)
Du, Hongtao; Qi, Hairong; Peterson, Gregory D.
2004-04-01
Advances in hyperspectral images have dramatically boosted remote sensing applications by providing abundant information using hundreds of contiguous spectral bands. However, the high volume of information also results in excessive computation burden. Since most materials have specific characteristics only at certain bands, a lot of these information is redundant. This property of hyperspectral images has motivated many researchers to study various dimensionality reduction algorithms, including Projection Pursuit (PP), Principal Component Analysis (PCA), wavelet transform, and Independent Component Analysis (ICA), where ICA is one of the most popular techniques. It searches for a linear or nonlinear transformation which minimizes the statistical dependence between spectral bands. Through this process, ICA can eliminate superfluous but retain practical information given only the observations of hyperspectral images. One hurdle of applying ICA in hyperspectral image (HSI) analysis, however, is its long computation time, especially for high volume hyperspectral data sets. Even the most efficient method, FastICA, is a very time-consuming process. In this paper, we present a parallel ICA (pICA) algorithm derived from FastICA. During the unmixing process, pICA divides the estimation of weight matrix into sub-processes which can be conducted in parallel on multiple processors. The decorrelation process is decomposed into the internal decorrelation and the external decorrelation, which perform weight vector decorrelations within individual processors and between cooperative processors, respectively. In order to further improve the performance of pICA, we seek hardware solutions in the implementation of pICA. Until now, there are very few hardware designs for ICA-related processes due to the complicated and iterant computation. This paper discusses capacity limitation of FPGA implementations for pICA in HSI analysis. A synthesis of Application-Specific Integrated Circuit (ASIC) is designed for pICA-based dimensionality reduction in HSI analysis. The pICA design is implemented using standard-height cells and aimed at TSMC 0.18 micron process. During the synthesis procedure, three ICA-related reconfigurable components are developed for the reuse and retargeting purpose. Preliminary results show that the standard-height cell based ASIC synthesis provide an effective solution for pICA and ICA-related processes in HSI analysis.
Enhancing quantum annealing performance for the molecular similarity problem
NASA Astrophysics Data System (ADS)
Hernandez, Maritza; Aramon, Maliheh
2017-05-01
Quantum annealing is a promising technique which leverages quantum mechanics to solve hard optimization problems. Considerable progress has been made in the development of a physical quantum annealer, motivating the study of methods to enhance the efficiency of such a solver. In this work, we present a quantum annealing approach to measure similarity among molecular structures. Implementing real-world problems on a quantum annealer is challenging due to hardware limitations such as sparse connectivity, intrinsic control error, and limited precision. In order to overcome the limited connectivity, a problem must be reformulated using minor-embedding techniques. Using a real data set, we investigate the performance of a quantum annealer in solving the molecular similarity problem. We provide experimental evidence that common practices for embedding can be replaced by new alternatives which mitigate some of the hardware limitations and enhance its performance. Common practices for embedding include minimizing either the number of qubits or the chain length and determining the strength of ferromagnetic couplers empirically. We show that current criteria for selecting an embedding do not improve the hardware's performance for the molecular similarity problem. Furthermore, we use a theoretical approach to determine the strength of ferromagnetic couplers. Such an approach removes the computational burden of the current empirical approaches and also results in hardware solutions that can benefit from simple local classical improvement. Although our results are limited to the problems considered here, they can be generalized to guide future benchmarking studies.
Ka-Band Wide-Bandgap Solid-State Power Amplifier: Hardware Validation
NASA Technical Reports Server (NTRS)
Epp, L.; Khan, P.; Silva, A.
2005-01-01
Motivated by recent advances in wide-bandgap (WBG) gallium nitride (GaN) semiconductor technology, there is considerable interest in developing efficient solid-state power amplifiers (SSPAs) as an alternative to the traveling-wave tube amplifier (TWTA) for space applications. This article documents proof-of-concept hardware used to validate power-combining technologies that may enable a 120-W, 40 percent power-added efficiency (PAE) SSPA. Results in previous articles [1-3] indicate that architectures based on at least three power combiner designs are likely to enable the target SSPA. Previous architecture performance analyses and estimates indicate that the proposed architectures can power combine 16 to 32 individual monolithic microwave integrated circuits (MMICs) with >80 percent combining efficiency. This combining efficiency would correspond to MMIC requirements of 5- to 10-W output power and >48 percent PAE. In order to validate the performance estimates of the three proposed architectures, measurements of proof-of-concept hardware are reported here.
NASA Technical Reports Server (NTRS)
Allard, R.; Mack, B.; Bayoumi, M. M.
1989-01-01
Most robot systems lack a suitable hardware and software environment for the efficient research of new control and sensing schemes. Typically, engineers and researchers need to be experts in control, sensing, programming, communication and robotics in order to implement, integrate and test new ideas in a robot system. In order to reduce this time, the Robot Controller Test Station (RCTS) has been developed. It uses a modular hardware and software architecture allowing easy physical and functional reconfiguration of a robot. This is accomplished by emphasizing four major design goals: flexibility, portability, ease of use, and ease of modification. An enhanced distributed processing version of RCTS is described. It features an expanded and more flexible communication system design. Distributed processing results in the availability of more local computing power and retains the low cost of microprocessors. A large number of possible communication, control and sensing schemes can therefore be easily introduced and tested, using the same basic software structure.
Paraskevopoulou, Sivylla E; Wu, Di; Eftekhar, Amir; Constandinou, Timothy G
2014-09-30
This work presents a novel unsupervised algorithm for real-time adaptive clustering of neural spike data (spike sorting). The proposed Hierarchical Adaptive Means (HAM) clustering method combines centroid-based clustering with hierarchical cluster connectivity to classify incoming spikes using groups of clusters. It is described how the proposed method can adaptively track the incoming spike data without requiring any past history, iteration or training and autonomously determines the number of spike classes. Its performance (classification accuracy) has been tested using multiple datasets (both simulated and recorded) achieving a near-identical accuracy compared to k-means (using 10-iterations and provided with the number of spike classes). Also, its robustness in applying to different feature extraction methods has been demonstrated by achieving classification accuracies above 80% across multiple datasets. Last but crucially, its low complexity, that has been quantified through both memory and computation requirements makes this method hugely attractive for future hardware implementation. Copyright © 2014 Elsevier B.V. All rights reserved.
Toward Evolvable Hardware Chips: Experiments with a Programmable Transistor Array
NASA Technical Reports Server (NTRS)
Stoica, Adrian
1998-01-01
Evolvable Hardware is reconfigurable hardware that self-configures under the control of an evolutionary algorithm. We search for a hardware configuration can be performed using software models or, faster and more accurate, directly in reconfigurable hardware. Several experiments have demonstrated the possibility to automatically synthesize both digital and analog circuits. The paper introduces an approach to automated synthesis of CMOS circuits, based on evolution on a Programmable Transistor Array (PTA). The approach is illustrated with a software experiment showing evolutionary synthesis of a circuit with a desired DC characteristic. A hardware implementation of a test PTA chip is then described, and the same evolutionary experiment is performed on the chip demonstrating circuit synthesis/self-configuration directly in hardware.
NASA Technical Reports Server (NTRS)
Carroll, Chester C.; Youngblood, John N.; Saha, Aindam
1987-01-01
Improvements and advances in the development of computer architecture now provide innovative technology for the recasting of traditional sequential solutions into high-performance, low-cost, parallel system to increase system performance. Research conducted in development of specialized computer architecture for the algorithmic execution of an avionics system, guidance and control problem in real time is described. A comprehensive treatment of both the hardware and software structures of a customized computer which performs real-time computation of guidance commands with updated estimates of target motion and time-to-go is presented. An optimal, real-time allocation algorithm was developed which maps the algorithmic tasks onto the processing elements. This allocation is based on the critical path analysis. The final stage is the design and development of the hardware structures suitable for the efficient execution of the allocated task graph. The processing element is designed for rapid execution of the allocated tasks. Fault tolerance is a key feature of the overall architecture. Parallel numerical integration techniques, tasks definitions, and allocation algorithms are discussed. The parallel implementation is analytically verified and the experimental results are presented. The design of the data-driven computer architecture, customized for the execution of the particular algorithm, is discussed.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Carroll, C.C.; Youngblood, J.N.; Saha, A.
1987-12-01
Improvements and advances in the development of computer architecture now provide innovative technology for the recasting of traditional sequential solutions into high-performance, low-cost, parallel system to increase system performance. Research conducted in development of specialized computer architecture for the algorithmic execution of an avionics system, guidance and control problem in real time is described. A comprehensive treatment of both the hardware and software structures of a customized computer which performs real-time computation of guidance commands with updated estimates of target motion and time-to-go is presented. An optimal, real-time allocation algorithm was developed which maps the algorithmic tasks onto the processingmore » elements. This allocation is based on the critical path analysis. The final stage is the design and development of the hardware structures suitable for the efficient execution of the allocated task graph. The processing element is designed for rapid execution of the allocated tasks. Fault tolerance is a key feature of the overall architecture. Parallel numerical integration techniques, tasks definitions, and allocation algorithms are discussed. The parallel implementation is analytically verified and the experimental results are presented. The design of the data-driven computer architecture, customized for the execution of the particular algorithm, is discussed.« less
Siddiqui, M F; Reza, A W; Kanesan, J; Ramiah, H
2014-01-01
A wide interest has been observed to find a low power and area efficient hardware design of discrete cosine transform (DCT) algorithm. This research work proposed a novel Common Subexpression Elimination (CSE) based pipelined architecture for DCT, aimed at reproducing the cost metrics of power and area while maintaining high speed and accuracy in DCT applications. The proposed design combines the techniques of Canonical Signed Digit (CSD) representation and CSE to implement the multiplier-less method for fixed constant multiplication of DCT coefficients. Furthermore, symmetry in the DCT coefficient matrix is used with CSE to further decrease the number of arithmetic operations. This architecture needs a single-port memory to feed the inputs instead of multiport memory, which leads to reduction of the hardware cost and area. From the analysis of experimental results and performance comparisons, it is observed that the proposed scheme uses minimum logic utilizing mere 340 slices and 22 adders. Moreover, this design meets the real time constraints of different video/image coders and peak-signal-to-noise-ratio (PSNR) requirements. Furthermore, the proposed technique has significant advantages over recent well-known methods along with accuracy in terms of power reduction, silicon area usage, and maximum operating frequency by 41%, 15%, and 15%, respectively.
NASA Technical Reports Server (NTRS)
Roman, Monsi C.; Perry, Jay L.; Howard, David F.
2014-01-01
The Advanced Exploration Systems (AES) Program's Atmosphere Resource Recovery and Environmental Monitoring (ARREM) Project have been developing atmosphere revitalization and environmental monitoring subsystem architectures suitable for enabling sustained crewed exploration missions beyond low Earth orbit (LEO). Using the International Space Station state-of-the-art (SOA) as the technical basis, the ARREM Project has contributed to technical advances that improve affordability, reliability, and functional efficiency while reducing dependence on a ground-based logistics resupply model. Functional demonstrations have merged new process technologies and concepts with existing ISS developmental hardware and operate them in a controlled environment simulating various crew metabolic loads. The ARREM Project's strengths include access to a full complement of existing developmental hardware that perform all the core atmosphere revitalization functions, unique testing facilities to evaluate subsystem performance, and a coordinated partnering effort among six NASA field centers and industry partners to provide the innovative expertise necessary to succeed. A project overview is provided and the project management strategies that have enabled a multidiscipinary engineering team to work efficiently across project, NASA field center, and industry boundaries to achieve the project's technical goals are discussed. Lessons learned and best practices relating to the project are presented and discussed.
Siddiqui, M. F.; Reza, A. W.; Kanesan, J.; Ramiah, H.
2014-01-01
A wide interest has been observed to find a low power and area efficient hardware design of discrete cosine transform (DCT) algorithm. This research work proposed a novel Common Subexpression Elimination (CSE) based pipelined architecture for DCT, aimed at reproducing the cost metrics of power and area while maintaining high speed and accuracy in DCT applications. The proposed design combines the techniques of Canonical Signed Digit (CSD) representation and CSE to implement the multiplier-less method for fixed constant multiplication of DCT coefficients. Furthermore, symmetry in the DCT coefficient matrix is used with CSE to further decrease the number of arithmetic operations. This architecture needs a single-port memory to feed the inputs instead of multiport memory, which leads to reduction of the hardware cost and area. From the analysis of experimental results and performance comparisons, it is observed that the proposed scheme uses minimum logic utilizing mere 340 slices and 22 adders. Moreover, this design meets the real time constraints of different video/image coders and peak-signal-to-noise-ratio (PSNR) requirements. Furthermore, the proposed technique has significant advantages over recent well-known methods along with accuracy in terms of power reduction, silicon area usage, and maximum operating frequency by 41%, 15%, and 15%, respectively. PMID:25133249
Hardware-software face detection system based on multi-block local binary patterns
NASA Astrophysics Data System (ADS)
Acasandrei, Laurentiu; Barriga, Angel
2015-03-01
Face detection is an important aspect for biometrics, video surveillance and human computer interaction. Due to the complexity of the detection algorithms any face detection system requires a huge amount of computational and memory resources. In this communication an accelerated implementation of MB LBP face detection algorithm targeting low frequency, low memory and low power embedded system is presented. The resulted implementation is time deterministic and uses a customizable AMBA IP hardware accelerator. The IP implements the kernel operations of the MB-LBP algorithm and can be used as universal accelerator for MB LBP based applications. The IP employs 8 parallel MB-LBP feature evaluators cores, uses a deterministic bandwidth, has a low area profile and the power consumption is ~95 mW on a Virtex5 XC5VLX50T. The resulted implementation acceleration gain is between 5 to 8 times, while the hardware MB-LBP feature evaluation gain is between 69 and 139 times.
Chao, Chun-Tang
2014-01-01
This paper presents the design and evaluation of the hardware circuit for electronic stethoscopes with heart sound cancellation capabilities using field programmable gate arrays (FPGAs). The adaptive line enhancer (ALE) was adopted as the filtering methodology to reduce heart sound attributes from the breath sounds obtained via the electronic stethoscope pickup. FPGAs were utilized to implement the ALE functions in hardware to achieve near real-time breath sound processing. We believe that such an implementation is unprecedented and crucial toward a truly useful, standalone medical device in outpatient clinic settings. The implementation evaluation with one Altera cyclone II–EP2C70F89 shows that the proposed ALE used 45% resources of the chip. Experiments with the proposed prototype were made using DE2-70 emulation board with recorded body signals obtained from online medical archives. Clear suppressions were observed in our experiments from both the frequency domain and time domain perspectives. PMID:24790573
The implementation and use of Ada on distributed systems with high reliability requirements
NASA Technical Reports Server (NTRS)
Knight, J. C.; Gregory, S. T.; Urquhart, J. I. A.
1984-01-01
The use and implementation of Ada (a trade mark of the US Dept. of Defense) in distributed environments in which the hardware are assumed to be unreliable were investigated. The possibility that a distributed system is programmed entirely in Ada so that the individual tasks of the system are unconcerned with which processors they are executing on and failures occurring in the underlying hardware were examined.
NASA Astrophysics Data System (ADS)
Mohan, Dhanya; Kumar, C. Santhosh
2016-03-01
Predicting the physiological condition (normal/abnormal) of a patient is highly desirable to enhance the quality of health care. Multi-parameter patient monitors (MPMs) using heart rate, arterial blood pressure, respiration rate and oxygen saturation (S pO2) as input parameters were developed to monitor the condition of patients, with minimum human resource utilization. The Support vector machine (SVM), an advanced machine learning approach popularly used for classification and regression is used for the realization of MPMs. For making MPMs cost effective, we experiment on the hardware implementation of the MPM using support vector machine classifier. The training of the system is done using the matlab environment and the detection of the alarm/noalarm condition is implemented in hardware. We used different kernels for SVM classification and note that the best performance was obtained using intersection kernel SVM (IKSVM). The intersection kernel support vector machine classifier MPM has outperformed the best known MPM using radial basis function kernel by an absoute improvement of 2.74% in accuracy, 1.86% in sensitivity and 3.01% in specificity. The hardware model was developed based on the improved performance system using Verilog Hardware Description Language and was implemented on Altera cyclone-II development board.
The digital compensation technology system for automotive pressure sensor
NASA Astrophysics Data System (ADS)
Guo, Bin; Li, Quanling; Lu, Yi; Luo, Zai
2011-05-01
Piezoresistive pressure sensor be made of semiconductor silicon based on Piezoresistive phenomenon, has many characteristics. But since the temperature effect of semiconductor, the performance of silicon sensor is also changed by temperature, and the pressure sensor without temperature drift can not be produced at present. This paper briefly describe the principles of sensors, the function of pressure sensor and the various types of compensation method, design the detailed digital compensation program for automotive pressure sensor. Simulation-Digital mixed signal conditioning is used in this dissertation, adopt signal conditioning chip MAX1452. AVR singlechip ATMEGA128 and other apparatus; fulfill the design of digital pressure sensor hardware circuit and singlechip hardware circuit; simultaneously design the singlechip software; Digital pressure sensor hardware circuit is used to implementing the correction and compensation of sensor; singlechip hardware circuit is used to implementing to controll the correction and compensation of pressure sensor; singlechip software is used to implementing to fulfill compensation arithmetic. In the end, it implement to measure the output of sensor, and contrast to the data of non-compensation, the outcome indicates that the compensation precision of compensated sensor output is obviously better than non-compensation sensor, not only improving the compensation precision but also increasing the stabilization of pressure sensor.
Superconducting Optoelectronic Circuits for Neuromorphic Computing
NASA Astrophysics Data System (ADS)
Shainline, Jeffrey M.; Buckley, Sonia M.; Mirin, Richard P.; Nam, Sae Woo
2017-03-01
Neural networks have proven effective for solving many difficult computational problems, yet implementing complex neural networks in software is computationally expensive. To explore the limits of information processing, it is necessary to implement new hardware platforms with large numbers of neurons, each with a large number of connections to other neurons. Here we propose a hybrid semiconductor-superconductor hardware platform for the implementation of neural networks and large-scale neuromorphic computing. The platform combines semiconducting few-photon light-emitting diodes with superconducting-nanowire single-photon detectors to behave as spiking neurons. These processing units are connected via a network of optical waveguides, and variable weights of connection can be implemented using several approaches. The use of light as a signaling mechanism overcomes fanout and parasitic constraints on electrical signals while simultaneously introducing physical degrees of freedom which can be employed for computation. The use of supercurrents achieves the low power density (1 mW /cm2 at 20-MHz firing rate) necessary to scale to systems with enormous entropy. Estimates comparing the proposed hardware platform to a human brain show that with the same number of neurons (1 011) and 700 independent connections per neuron, the hardware presented here may achieve an order of magnitude improvement in synaptic events per second per watt.
Lu, Zhaojun; Li, Dongfang; Liu, Hailong; Gong, Mingyang; Liu, Zhenglin
2017-01-01
Wireless sensor networks (WSNs) are an emerging technology employed in some crucial applications. However, limited resources and physical exposure to attackers make security a challenging issue for a WSN. Ring oscillator-based physical unclonable function (RO PUF) is a potential option to protect the security of sensor nodes because it is able to generate random responses efficiently for a key extraction mechanism, which prevents the non-volatile memory from storing secret keys. In order to deploy RO PUF in a WSN, hardware efficiency, randomness, uniqueness, and reliability should be taken into account. Besides, the resistance to electromagnetic (EM) analysis attack is important to guarantee the security of RO PUF itself. In this paper, we propose a novel architecture of configurable RO PUF based on exclusive-or (XOR) gates. First, it dramatically increases the hardware efficiency compared with other types of RO PUFs. Second, it mitigates the vulnerability to EM analysis attack by placing the adjacent RO arrays in accordance with the cosine wave and sine wave so that the frequency of each RO cannot be detected. We implement our proposal in XINLINX A-7 field programmable gate arrays (FPGAs) and conduct a set of experiments to evaluate the quality of the responses. The results show that responses pass the National Institute of Standards and Technology (NIST) statistical test and have good uniqueness and reliability under different environments. Therefore, the proposed configurable RO PUF is suitable to establish a key extraction mechanism in a WSN. PMID:28914756
Towards Batched Linear Solvers on Accelerated Hardware Platforms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Haidar, Azzam; Dong, Tingzing Tim; Tomov, Stanimire
2015-01-01
As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floating-point operation. In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representingmore » the algorithms as a sequence of batched BLAS routines for GPU-contained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for large-matrix factorizations. But it is different from a straightforward approach, whereby each of GPU's symmetric multiprocessors factorizes a single problem at a time. We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a two-sockets, Intel Sandy Bridge server. Compared to a batched LU factorization featured in the NVIDIA's CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU.« less
Lu, Zhaojun; Li, Dongfang; Liu, Hailong; Gong, Mingyang; Liu, Zhenglin
2017-09-15
Wireless sensor networks (WSNs) are an emerging technology employed in some crucial applications. However, limited resources and physical exposure to attackers make security a challenging issue for a WSN. Ring oscillator-based physical unclonable function (RO PUF) is a potential option to protect the security of sensor nodes because it is able to generate random responses efficiently for a key extraction mechanism, which prevents the non-volatile memory from storing secret keys. In order to deploy RO PUF in a WSN, hardware efficiency, randomness, uniqueness, and reliability should be taken into account. Besides, the resistance to electromagnetic (EM) analysis attack is important to guarantee the security of RO PUF itself. In this paper, we propose a novel architecture of configurable RO PUF based on exclusive-or (XOR) gates. First, it dramatically increases the hardware efficiency compared with other types of RO PUFs. Second, it mitigates the vulnerability to EM analysis attack by placing the adjacent RO arrays in accordance with the cosine wave and sine wave so that the frequency of each RO cannot be detected. We implement our proposal in XINLINX A-7 field programmable gate arrays (FPGAs) and conduct a set of experiments to evaluate the quality of the responses. The results show that responses pass the National Institute of Standards and Technology (NIST) statistical test and have good uniqueness and reliability under different environments. Therefore, the proposed configurable RO PUF is suitable to establish a key extraction mechanism in a WSN.
NASA Technical Reports Server (NTRS)
Dugala, Gina M.
2009-01-01
The U.S. Department of Energy (DOE), Lockheed Martin Space Company (LMSC), Sun power Inc., and NASA Glenn Research Center (GRC) have been developing an Advanced Stirling Radioisotope Generator (ASRG) for use as a power system on space science missions. This generator will make use of free-piston Stirling convertors to achieve higher conversion efficiency than currently available alternatives. NASA GRC's support of ASRG development includes extended operation testing of Advanced Stirling Convertors (ASCs) developed by Sunpower Inc. In the past year, NASA GRC has been building a test facility to support extended operation of a pair of engineering level ASCs. Operation of the convertors in the test facility provides convertor performance data over an extended period of time. Mechanical support hardware, data acquisition software, and an instrumentation rack were developed to prepare the pair of convertors for continuous extended operation. Short-term tests were performed to gather baseline performance data before extended operation was initiated. These tests included workmanship vibration, insulation thermal loss characterization, low-temperature checkout, and fUll-power operation. Hardware and software features are implemented to ensure reliability of support systems. This paper discusses the mechanical support hardware, instrumentation rack, data acquisition software, short-term tests, and safety features designed to support continuous unattended operation of a pair of ASCs.
Hardware for dynamic quantum computing experiments: Part I
NASA Astrophysics Data System (ADS)
Johnson, Blake; Ryan, Colm; Riste, Diego; Donovan, Brian; Ohki, Thomas
Static, pre-defined control sequences routinely achieve high-fidelity operation on superconducting quantum processors. Efforts toward dynamic experiments depending on real-time information have mostly proceeded through hardware duplication and triggers, requiring a combinatorial explosion in the number of channels. We provide a hardware efficient solution to dynamic control with a complete platform of specialized FPGA-based control and readout electronics; these components enable arbitrary control flow, low-latency feedback and/or feedforward, and scale far beyond single-qubit control and measurement. We will introduce the BBN Arbitrary Pulse Sequencer 2 (APS2) control system and the X6 QDSP readout platform. The BBN APS2 features: a sequencer built around implementing short quantum gates, a sequence cache to allow long sequences with branching structures, subroutines for code re-use, and a trigger distribution module to capture and distribute steering information. The X6 QDSP features a single-stage DSP pipeline that combines demodulation with arbitrary integration kernels, and multiple taps to inspect data flow for debugging and calibration. We will show system performance when putting it all together, including a latency budget for feedforward operations. This research was funded by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), through the Army Research Office Contract No. W911NF-10-1-0324.
Deploying a Route Optimization EFB Application for Commercial Airline Operational Trials
NASA Technical Reports Server (NTRS)
Roscoe, David A.; Vivona, Robert A.; Woods, Sharon E.; Karr, David A.; Wing, David J.
2016-01-01
The Traffic Aware Planner (TAP), developed for NASA Langley Research Center to support the Traffic Aware Strategic Aircrew Requests (TASAR) project, is a flight-efficiency software application developed for an Electronic Flight Bag (EFB). Tested in two flight trials and planned for operational testing by two commercial airlines, TAP is a real-time trajectory optimization application that leverages connectivity with onboard avionics and broadband Internet sources to compute and recommend route modifications to flight crews to improve fuel and time performance. The application utilizes a wide range of data, including Automatic Dependent Surveillance Broadcast (ADS-B) traffic, Flight Management System (FMS) guidance and intent, on-board sensors, published winds and weather, and Special Use Airspace (SUA) schedules. This paper discusses the challenges of developing and deploying TAP to various EFB platforms, our solutions to some of these challenges, and lessons learned, to assist commercial software developers and hardware manufacturers in their efforts to implement and extend TAP functionality in their environments. EFB applications (such as TAP) typically access avionics data via an ARINC 834 Simple Text Avionics Protocol (STAP) server hosted by an Aircraft Interface Device (AID) or other installed hardware. While the protocol is standardized, the data sources, content, and transmission rates can vary from aircraft to aircraft. Additionally, the method of communicating with the AID may vary depending on EFB hardware and/or the availability of onboard networking services, such as Ethernet, WIFI, Bluetooth, or other mechanisms. EFBs with portable and installed components can be implemented using a variety of operating systems, and cockpits are increasingly incorporating tablet-based technologies, further expanding the number of platforms the application may need to support. Supporting multiple EFB platforms, AIDs, avionics datasets, and user interfaces presents a challenge for software developers and the management of their code baselines. Maintaining multiple baselines to support all deployment targets can be extremely cumbersome and expensive. Certification also needs to be considered when developing the application. Regardless of whether the software is itself destined to be certified, data requirements in support of the application and user interface elements may introduce certification requirements for EFB manufacturers and the airlines. The example of TAP, the challenges faced, solutions implemented, and lessons learned will give EFB application and hardware developers insight into future potential requirements in deploying TAP or similar flight-deck EFB applications.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Warkentin, H; Bubric, K; Giovannetti, H
2016-06-15
Purpose: As a quality improvement measure, we undertook this work to incorporate usability testing into the implementation procedures for new electronic documents and forms used by four affiliated radiation therapy centers. Methods: A human factors specialist provided training in usability testing for a team of medical physicists, radiation therapists, and radiation oncologists from four radiotherapy centers. A usability testing plan was then developed that included controlled scenarios and standardized forms for qualitative and quantitative feedback from participants, including patients. Usability tests were performed by end users using the same hardware and viewing conditions that are found in the clinical environment.more » A pilot test of a form used during radiotherapy CT simulation was performed in a single department; feedback informed adaptive improvements to the electronic form, hardware requirements, resource accessibility and the usability testing plan. Following refinements to the testing plan, usability testing was performed at three affiliated cancer centers with different vault layouts and hardware. Results: Feedback from the testing resulted in the detection of 6 critical errors (omissions and inability to complete task without assistance), 6 non-critical errors (recoverable), and multiple suggestions for improvement. Usability problems with room layout were detected at one center and problems with hardware were detected at one center. Upon amalgamation and summary of the results, three key recommendations were presented to the document’s authors for incorporation into the electronic form. Documented inefficiencies and patient safety concerns related to the room layout and hardware were presented to administration along with a request for funding to purchase upgraded hardware and accessories to allow a more efficient workflow within the simulator vault. Conclusion: By including usability testing as part of the process when introducing any new document or procedure into clinical use, associated risks can be identified and mitigated before patient care and clinical workflow are impacted.« less
GPU Lossless Hyperspectral Data Compression System for Space Applications
NASA Technical Reports Server (NTRS)
Keymeulen, Didier; Aranki, Nazeeh; Hopson, Ben; Kiely, Aaron; Klimesh, Matthew; Benkrid, Khaled
2012-01-01
On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. At JPL, a novel, adaptive and predictive technique for lossless compression of hyperspectral data, named the Fast Lossless (FL) algorithm, was recently developed. This technique uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. Because of its outstanding performance and suitability for real-time onboard hardware implementation, the FL compressor is being formalized as the emerging CCSDS Standard for Lossless Multispectral & Hyperspectral image compression. The FL compressor is well-suited for parallel hardware implementation. A GPU hardware implementation was developed for FL targeting the current state-of-the-art GPUs from NVIDIA(Trademark). The GPU implementation on a NVIDIA(Trademark) GeForce(Trademark) GTX 580 achieves a throughput performance of 583.08 Mbits/sec (44.85 MSamples/sec) and an acceleration of at least 6 times a software implementation running on a 3.47 GHz single core Intel(Trademark) Xeon(Trademark) processor. This paper describes the design and implementation of the FL algorithm on the GPU. The massively parallel implementation will provide in the future a fast and practical real-time solution for airborne and space applications.
Manyscale Computing for Sensor Processing in Support of Space Situational Awareness
NASA Astrophysics Data System (ADS)
Schmalz, M.; Chapman, W.; Hayden, E.; Sahni, S.; Ranka, S.
2014-09-01
Increasing image and signal data burden associated with sensor data processing in support of space situational awareness implies continuing computational throughput growth beyond the petascale regime. In addition to growing applications data burden and diversity, the breadth, diversity and scalability of high performance computing architectures and their various organizations challenge the development of a single, unifying, practicable model of parallel computation. Therefore, models for scalable parallel processing have exploited architectural and structural idiosyncrasies, yielding potential misapplications when legacy programs are ported among such architectures. In response to this challenge, we have developed a concise, efficient computational paradigm and software called Manyscale Computing to facilitate efficient mapping of annotated application codes to heterogeneous parallel architectures. Our theory, algorithms, software, and experimental results support partitioning and scheduling of application codes for envisioned parallel architectures, in terms of work atoms that are mapped (for example) to threads or thread blocks on computational hardware. Because of the rigor, completeness, conciseness, and layered design of our manyscale approach, application-to-architecture mapping is feasible and scalable for architectures at petascales, exascales, and above. Further, our methodology is simple, relying primarily on a small set of primitive mapping operations and support routines that are readily implemented on modern parallel processors such as graphics processing units (GPUs) and hybrid multi-processors (HMPs). In this paper, we overview the opportunities and challenges of manyscale computing for image and signal processing in support of space situational awareness applications. We discuss applications in terms of a layered hardware architecture (laboratory > supercomputer > rack > processor > component hierarchy). Demonstration applications include performance analysis and results in terms of execution time as well as storage, power, and energy consumption for bus-connected and/or networked architectures. The feasibility of the manyscale paradigm is demonstrated by addressing four principal challenges: (1) architectural/structural diversity, parallelism, and locality, (2) masking of I/O and memory latencies, (3) scalability of design as well as implementation, and (4) efficient representation/expression of parallel applications. Examples will demonstrate how manyscale computing helps solve these challenges efficiently on real-world computing systems.
NASA Technical Reports Server (NTRS)
Steele, John W.; Rector, Tony; Gazda, Daniel; Lewis, John
2011-01-01
An EMU water processing kit (Airlock Coolant Loop Recovery -- A/L CLR) was developed as a corrective action to Extravehicular Mobility Unit (EMU) coolant flow disruptions experienced on the International Space Station (ISS) in May of 2004 and thereafter. A conservative duty cycle and set of use parameters for A/L CLR use and component life were initially developed and implemented based on prior analysis results and analytical modeling. Several initiatives were undertaken to optimize the duty cycle and use parameters of the hardware. Examination of post-flight samples and EMU Coolant Loop hardware provided invaluable information on the performance of the A/L CLR and has allowed for an optimization of the process. The intent of this paper is to detail the evolution of the A/L CLR hardware, efforts to optimize the duty cycle and use parameters, and the final recommendations for implementation in the post-Shuttle retirement era.
NASA Technical Reports Server (NTRS)
Cole, Richard
1991-01-01
The major goals of this effort are as follows: (1) to examine technology insertion options to optimize Advanced Information Processing System (AIPS) performance in the Advanced Launch System (ALS) environment; (2) to examine the AIPS concepts to ensure that valuable new technologies are not excluded from the AIPS/ALS implementations; (3) to examine advanced microprocessors applicable to AIPS/ALS, (4) to examine radiation hardening technologies applicable to AIPS/ALS; (5) to reach conclusions on AIPS hardware building blocks implementation technologies; and (6) reach conclusions on appropriate architectural improvements. The hardware building blocks are the Fault-Tolerant Processor, the Input/Output Sequencers (IOS), and the Intercomputer Interface Sequencers (ICIS).
NASA Astrophysics Data System (ADS)
Prezioso, M.; Merrikh-Bayat, F.; Chakrabarti, B.; Strukov, D.
2016-02-01
Artificial neural networks have been receiving increasing attention due to their superior performance in many information processing tasks. Typically, scaling up the size of the network results in better performance and richer functionality. However, large neural networks are challenging to implement in software and customized hardware are generally required for their practical implementations. In this work, we will discuss our group's recent efforts on the development of such custom hardware circuits, based on hybrid CMOS/memristor circuits, in particular of CMOL variety. We will start by reviewing the basics of memristive devices and of CMOL circuits. We will then discuss our recent progress towards demonstration of hybrid circuits, focusing on the experimental and theoretical results for artificial neural networks based on crossbarintegrated metal oxide memristors. We will conclude presentation with the discussion of the remaining challenges and the most pressing research needs.
Implementing real-time robotic systems using CHIMERA II
NASA Technical Reports Server (NTRS)
Stewart, David B.; Schmitz, Donald E.; Khosla, Pradeep K.
1990-01-01
A description is given of the CHIMERA II programming environment and operating system, which was developed for implementing real-time robotic systems. Sensor-based robotic systems contain both general- and special-purpose hardware, and thus the development of applications tends to be a time-consuming task. The CHIMERA II environment is designed to reduce the development time by providing a convenient software interface between the hardware and the user. CHIMERA II supports flexible hardware configurations which are based on one or more VME-backplanes. All communication across multiple processors is transparent to the user through an extensive set of interprocessor communication primitives. CHIMERA II also provides a high-performance real-time kernel which supports both deadline and highest-priority-first scheduling. The flexibility of CHIMERA II allows hierarchical models for robot control, such as NASREM, to be implemented with minimal programming time and effort.
Implementing the Data Center Energy Productivity Metric in a High Performance Computing Data Center
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sego, Landon H.; Marquez, Andres; Rawson, Andrew
2013-06-30
As data centers proliferate in size and number, the improvement of their energy efficiency and productivity has become an economic and environmental imperative. Making these improvements requires metrics that are robust, interpretable, and practical. We discuss the properties of a number of the proposed metrics of energy efficiency and productivity. In particular, we focus on the Data Center Energy Productivity (DCeP) metric, which is the ratio of useful work produced by the data center to the energy consumed performing that work. We describe our approach for using DCeP as the principal outcome of a designed experiment using a highly instrumented,more » high-performance computing data center. We found that DCeP was successful in clearly distinguishing different operational states in the data center, thereby validating its utility as a metric for identifying configurations of hardware and software that would improve energy productivity. We also discuss some of the challenges and benefits associated with implementing the DCeP metric, and we examine the efficacy of the metric in making comparisons within a data center and between data centers.« less
NASA Astrophysics Data System (ADS)
Larger, Laurent; Baylón-Fuentes, Antonio; Martinenghi, Romain; Udaltsov, Vladimir S.; Chembo, Yanne K.; Jacquot, Maxime
2017-01-01
Reservoir computing, originally referred to as an echo state network or a liquid state machine, is a brain-inspired paradigm for processing temporal information. It involves learning a "read-out" interpretation for nonlinear transients developed by high-dimensional dynamics when the latter is excited by the information signal to be processed. This novel computational paradigm is derived from recurrent neural network and machine learning techniques. It has recently been implemented in photonic hardware for a dynamical system, which opens the path to ultrafast brain-inspired computing. We report on a novel implementation involving an electro-optic phase-delay dynamics designed with off-the-shelf optoelectronic telecom devices, thus providing the targeted wide bandwidth. Computational efficiency is demonstrated experimentally with speech-recognition tasks. State-of-the-art speed performances reach one million words per second, with very low word error rate. Additionally, to record speed processing, our investigations have revealed computing-efficiency improvements through yet-unexplored temporal-information-processing techniques, such as simultaneous multisample injection and pitched sampling at the read-out compared to information "write-in".
Large-Scale Simulations of Plastic Neural Networks on Neuromorphic Hardware
Knight, James C.; Tully, Philip J.; Kaplan, Bernhard A.; Lansner, Anders; Furber, Steve B.
2016-01-01
SpiNNaker is a digital, neuromorphic architecture designed for simulating large-scale spiking neural networks at speeds close to biological real-time. Rather than using bespoke analog or digital hardware, the basic computational unit of a SpiNNaker system is a general-purpose ARM processor, allowing it to be programmed to simulate a wide variety of neuron and synapse models. This flexibility is particularly valuable in the study of biological plasticity phenomena. A recently proposed learning rule based on the Bayesian Confidence Propagation Neural Network (BCPNN) paradigm offers a generic framework for modeling the interaction of different plasticity mechanisms using spiking neurons. However, it can be computationally expensive to simulate large networks with BCPNN learning since it requires multiple state variables for each synapse, each of which needs to be updated every simulation time-step. We discuss the trade-offs in efficiency and accuracy involved in developing an event-based BCPNN implementation for SpiNNaker based on an analytical solution to the BCPNN equations, and detail the steps taken to fit this within the limited computational and memory resources of the SpiNNaker architecture. We demonstrate this learning rule by learning temporal sequences of neural activity within a recurrent attractor network which we simulate at scales of up to 2.0 × 104 neurons and 5.1 × 107 plastic synapses: the largest plastic neural network ever to be simulated on neuromorphic hardware. We also run a comparable simulation on a Cray XC-30 supercomputer system and find that, if it is to match the run-time of our SpiNNaker simulation, the super computer system uses approximately 45× more power. This suggests that cheaper, more power efficient neuromorphic systems are becoming useful discovery tools in the study of plasticity in large-scale brain models. PMID:27092061
Xia, Wenjun; Mita, Yoshio; Shibata, Tadashi
2016-05-01
Aiming at efficient data condensation and improving accuracy, this paper presents a hardware-friendly template reduction (TR) method for the nearest neighbor (NN) classifiers by introducing the concept of critical boundary vectors. A hardware system is also implemented to demonstrate the feasibility of using an field-programmable gate array (FPGA) to accelerate the proposed method. Initially, k -means centers are used as substitutes for the entire template set. Then, to enhance the classification performance, critical boundary vectors are selected by a novel learning algorithm, which is completed within a single iteration. Moreover, to remove noisy boundary vectors that can mislead the classification in a generalized manner, a global categorization scheme has been explored and applied to the algorithm. The global characterization automatically categorizes each classification problem and rapidly selects the boundary vectors according to the nature of the problem. Finally, only critical boundary vectors and k -means centers are used as the new template set for classification. Experimental results for 24 data sets show that the proposed algorithm can effectively reduce the number of template vectors for classification with a high learning speed. At the same time, it improves the accuracy by an average of 2.17% compared with the traditional NN classifiers and also shows greater accuracy than seven other TR methods. We have shown the feasibility of using a proof-of-concept FPGA system of 256 64-D vectors to accelerate the proposed method on hardware. At a 50-MHz clock frequency, the proposed system achieves a 3.86 times higher learning speed than on a 3.4-GHz PC, while consuming only 1% of the power of that used by the PC.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU.
Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan
2013-01-01
This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU
Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan
2013-01-01
This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis. PMID:23840507
System-level protection and hardware Trojan detection using weighted voting.
Amin, Hany A M; Alkabani, Yousra; Selim, Gamal M I
2014-07-01
The problem of hardware Trojans is becoming more serious especially with the widespread of fabless design houses and design reuse. Hardware Trojans can be embedded on chip during manufacturing or in third party intellectual property cores (IPs) during the design process. Recent research is performed to detect Trojans embedded at manufacturing time by comparing the suspected chip with a golden chip that is fully trusted. However, Trojan detection in third party IP cores is more challenging than other logic modules especially that there is no golden chip. This paper proposes a new methodology to detect/prevent hardware Trojans in third party IP cores. The method works by gradually building trust in suspected IP cores by comparing the outputs of different untrusted implementations of the same IP core. Simulation results show that our method achieves higher probability of Trojan detection over a naive implementation of simple voting on the output of different IP cores. In addition, experimental results show that the proposed method requires less hardware overhead when compared with a simple voting technique achieving the same degree of security.
A Real-Time System for Lane Detection Based on FPGA and DSP
NASA Astrophysics Data System (ADS)
Xiao, Jing; Li, Shutao; Sun, Bin
2016-12-01
This paper presents a real-time lane detection system including edge detection and improved Hough Transform based lane detection algorithm and its hardware implementation with field programmable gate array (FPGA) and digital signal processor (DSP). Firstly, gradient amplitude and direction information are combined to extract lane edge information. Then, the information is used to determine the region of interest. Finally, the lanes are extracted by using improved Hough Transform. The image processing module of the system consists of FPGA and DSP. Particularly, the algorithms implemented in FPGA are working in pipeline and processing in parallel so that the system can run in real-time. In addition, DSP realizes lane line extraction and display function with an improved Hough Transform. The experimental results show that the proposed system is able to detect lanes under different road situations efficiently and effectively.
Distributed-Memory Fast Maximal Independent Set
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kanewala Appuhamilage, Thejaka Amila J.; Zalewski, Marcin J.; Lumsdaine, Andrew
The Maximal Independent Set (MIS) graph problem arises in many applications such as computer vision, information theory, molecular biology, and process scheduling. The growing scale of MIS problems suggests the use of distributed-memory hardware as a cost-effective approach to providing necessary compute and memory resources. Luby proposed four randomized algorithms to solve the MIS problem. All those algorithms are designed focusing on shared-memory machines and are analyzed using the PRAM model. These algorithms do not have direct efficient distributed-memory implementations. In this paper, we extend two of Luby’s seminal MIS algorithms, “Luby(A)” and “Luby(B),” to distributed-memory execution, and we evaluatemore » their performance. We compare our results with the “Filtered MIS” implementation in the Combinatorial BLAS library for two types of synthetic graph inputs.« less
Design and evaluation of a DAMQ multiprocessor network with self-compacting buffers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Park, J.; O`Krafka, B.W.O.; Vassiliadis, S.
1994-12-31
This paper describes a new approach to implement Dynamically Allocated Multi-Queue (DAMQ) switching elements using a technique called ``self-compacting buffers``. This technique is efficient in that the amount of hardware required to manage the buffers is relatively small; it offers high performance since it is an implementation of a DAMQ. The first part of this paper describes the self-compacting buffer architecture in detail, and compares it against a competing DAMQ switch design. The second part presents extensive simulation results comparing the performance of a self compacting buffer switch against an ideal switch including several examples of k-ary n-cubes and deltamore » networks. In addition, simulation results show how the performance of an entire network can be quickly and accurately approximated by simulating just a single switching element.« less
High-performance, multi-faceted research sonar electronics
NASA Astrophysics Data System (ADS)
Moseley, Julian W.
This thesis describes the design, implementation and testing of a research sonar system capable of performing complex applications such as coherent Doppler measurement and synthetic aperture imaging. Specifically, this thesis presents an approach to improve the precision of the timing control and increase the signal-to-noise ratio of an existing research sonar. A dedicated timing control subsystem, and hardware drivers are designed to improve the efficiency of the old sonar's timing operations. A low noise preamplifier is designed to reduce the noise component in the received signal arriving at the input of the system's data acquisition board. Noise analysis, frequency response, and timing simulation data are generated in order to predict the functionality and performance improvements expected when the subsystems are implemented. Experimental data, gathered using these subsys- tems, are presented, and are shown to closely match the simulation results, thus verifying performance.
Key Gaps for Enabling Plant Growth in Future Missions
NASA Technical Reports Server (NTRS)
Anderson, Molly S.; Barta, Daniel; Douglas, Grace; Fritsche, Ralph; Massa, Gioia; Wheeler, Ray; Quincy, Charles; Romeyn, Matthew; Motil, Brian; Hanford, Anthony
2017-01-01
Growing plants to provide food or psychological benefits to crewmembers is a common vision for the future of human spaceflight, often represented both in media and in serious concept studies. The complexity of controlled environment agriculture and of plant growth in microgravity have and continue to be the subject of dedicated scientific research. However, actually implementing these systems in a way that will be cost effective, efficient, and sustainable for future space missions is a complex, multi-disciplinary problem. Key questions exist in many areas: human research in nutrition and psychology, horticulture, plant physiology and microbiology, multi-phase microgravity fluid physics, hardware design and technology development, and system design, operations and mission planning. The criticality of the research, and the ideal solution, will vary depending on the mission and type of system implementation being considered.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fong, K. W.
1977-08-15
This report deals with some techniques in applied programming using the Livermore Timesharing System (LTSS) on the CDC 7600 computers at the National Magnetic Fusion Energy Computer Center (NMFECC) and the Lawrence Livermore Laboratory Computer Center (LLLCC or Octopus network). This report is based on a document originally written specifically about the system as it is implemented at NMFECC but has been revised to accommodate differences between LLLCC and NMFECC implementations. Topics include: maintaining programs, debugging, recovering from system crashes, and using the central processing unit, memory, and input/output devices efficiently and economically. Routines that aid in these procedures aremore » mentioned. The companion report, UCID-17556, An LTSS Compendium, discusses the hardware and operating system and should be read before reading this report.« less
Experimental model of a wind energy conversion system
NASA Astrophysics Data System (ADS)
Vasar, C.; Rat, C. L.; Prostean, O.
2018-01-01
The renewable energy domain represents an important issue for the sustainable development of the mankind in the actual context of increasing demand for energy along with the increasing pollution that affect the environment. A significant quota of the clean energy is represented by the wind energy. As a consequence, the developing of wind energy conversion systems (WECS) in order to achieve high energetic performances (efficiency, stability, availability, competitive cost etc) represents a topic of permanent actuality. Testing and developing of an optimized control strategy for a WECS direct implemented on a real energetic site is quite difficult and not cost efficient. Thus a more convenient solution consists in a flexible laboratory setup which requires an experimental model of a WECS. Such approach would allow the simulation of various real conditions very similar with existing energetic sites. This paper presents a grid-connected wind turbine emulator. The wind turbine is implemented through a real-time Hardware-in-the-Loop (HIL) emulator, which will be analyzed extensively in the paper. The HIL system uses software implemented in the LabVIEW programming environment to control an ABB ACS800 electric drive. ACS800 has the task of driving an induction machine coupled to a permanent magnet synchronous generator. The power obtained from the synchronous generator is rectified, filtered and sent to the main grid through a controlled inverter. The control strategy is implemented on a NI CompactRIO (cRIO) platform.
Software-Defined Architectures for Spectrally Efficient Cognitive Networking in Extreme Environments
NASA Astrophysics Data System (ADS)
Sklivanitis, Georgios
The objective of this dissertation is the design, development, and experimental evaluation of novel algorithms and reconfigurable radio architectures for spectrally efficient cognitive networking in terrestrial, airborne, and underwater environments. Next-generation wireless communication architectures and networking protocols that maximize spectrum utilization efficiency in congested/contested or low-spectral availability (extreme) communication environments can enable a rich body of applications with unprecedented societal impact. In recent years, underwater wireless networks have attracted significant attention for military and commercial applications including oceanographic data collection, disaster prevention, tactical surveillance, offshore exploration, and pollution monitoring. Unmanned aerial systems that are autonomously networked and fully mobile can assist humans in extreme or difficult-to-reach environments and provide cost-effective wireless connectivity for devices without infrastructure coverage. Cognitive radio (CR) has emerged as a promising technology to maximize spectral efficiency in dynamically changing communication environments by adaptively reconfiguring radio communication parameters. At the same time, the fast developing technology of software-defined radio (SDR) platforms has enabled hardware realization of cognitive radio algorithms for opportunistic spectrum access. However, existing algorithmic designs and protocols for shared spectrum access do not effectively capture the interdependencies between radio parameters at the physical (PHY), medium-access control (MAC), and network (NET) layers of the network protocol stack. In addition, existing off-the-shelf radio platforms and SDR programmable architectures are far from fulfilling runtime adaptation and reconfiguration across PHY, MAC, and NET layers. Spectrum allocation in cognitive networks with multi-hop communication requirements depends on the location, network traffic load, and interference profile at each network node. As a result, the development and implementation of algorithms and cross-layer reconfigurable radio platforms that can jointly treat space, time, and frequency as a unified resource to be dynamically optimized according to inter- and intra-network interference constraints is of fundamental importance. In the next chapters, we present novel algorithmic and software/hardware implementation developments toward the deployment of spectrally efficient terrestrial, airborne, and underwater wireless networks. In Chapter 1 we review the state-of-art in commercially available SDR platforms, describe their software and hardware capabilities, and classify them based on their ability to enable rapid prototyping and advance experimental research in wireless networks. Chapter 2 discusses system design and implementation details toward real-time evaluation of a software-radio platform for all-spectrum cognitive channelization in the presence of narrowband or wideband primary stations. All-spectrum channelization is achieved by designing maximum signal-to-interference-plus-noise ratio (SINR) waveforms that span the whole continuum of the device-accessible spectrum, while satisfying peak power and interference temperature (IT) constraints for the secondary and primary users, respectively. In Chapter 3, we introduce the concept of all-spectrum channelization based on max-SINR optimized sparse-binary waveforms, we propose optimal and suboptimal waveform design algorithms, and evaluate their SINR and bit-error-rate (BER) performance in an SDR testbed. Chapter 4 considers the problem of channel estimation with minimal pilot signaling in multi-cell multi-user multi-input multi-output (MIMO) systems with very large antenna arrays at the base station, and proposes a least-squares (LS)-type algorithm that iteratively extracts channel and data estimates from a short record of data measurements. Our algorithmic developments toward spectrally-efficient cognitive networking through joint optimization of channel access code-waveforms and routes in a multi-hop network are described in Chapter 5. Algorithmic designs are software optimized on heterogeneous multi-core general-purpose processor (GPP)-based SDR architectures by leveraging a novel software-radio framework that offers self-optimization and real-time adaptation capabilities at the PHY, MAC, and NET layers of the network protocol stack. Our system design approach is experimentally validated under realistic conditions in a large-scale hybrid ground-air testbed deployment. Chapter 6 reviews the state-of-art in software and hardware platforms for underwater wireless networking and proposes a software-defined acoustic modem prototype that enables (i) cognitive reconfiguration of PHY/MAC parameters, and (ii) cross-technology communication adaptation. The proposed modem design is evaluated in terms of effective communication data rate in both water tank and lake testbed setups. In Chapter 7, we present a novel receiver configuration for code-waveform-based multiple-access underwater communications. The proposed receiver is fully reconfigurable and executes (i) all-spectrum cognitive channelization, and (ii) combined synchronization, channel estimation, and demodulation. Experimental evaluation in terms of SINR and BER show that all-spectrum channelization is a powerful proposition for underwater communications. At the same time, the proposed receiver design can significantly enhance bandwidth utilization. Finally, in Chapter 8, we focus on challenging practical issues that arise in underwater acoustic sensor network setups where co-located multi-antenna sensor deployment is not feasible due to power, computation, and hardware limitations, and design, implement, and evaluate an underwater receiver structure that accounts for multiple carrier frequency and timing offsets in virtual (distributed) MIMO underwater systems.
Piromalis, Dimitrios; Arvanitis, Konstantinos
2016-08-04
Wireless Sensor and Actuators Networks (WSANs) constitute one of the most challenging technologies with tremendous socio-economic impact for the next decade. Functionally and energy optimized hardware systems and development tools maybe is the most critical facet of this technology for the achievement of such prospects. Especially, in the area of agriculture, where the hostile operating environment comes to add to the general technological and technical issues, reliable and robust WSAN systems are mandatory. This paper focuses on the hardware design architectures of the WSANs for real-world agricultural applications. It presents the available alternatives in hardware design and identifies their difficulties and problems for real-life implementations. The paper introduces SensoTube, a new WSAN hardware architecture, which is proposed as a solution to the various existing design constraints of WSANs. The establishment of the proposed architecture is based, firstly on an abstraction approach in the functional requirements context, and secondly, on the standardization of the subsystems connectivity, in order to allow for an open, expandable, flexible, reconfigurable, energy optimized, reliable and robust hardware system. The SensoTube implementation reference model together with its encapsulation design and installation are analyzed and presented in details. Furthermore, as a proof of concept, certain use cases have been studied in order to demonstrate the benefits of migrating existing designs based on the available open-source hardware platforms to SensoTube architecture.
Implementation of a system to provide mobile satellite services in North America
NASA Technical Reports Server (NTRS)
Johanson, Gary A.; Davies, N. George; Tisdale, William R. H.
1993-01-01
This paper describes the implementation of the ground network to support Mobile Satellite Services (MSS). The system is designed to take advantage of a powerful new satellite series and provides significant improvements in capacity and throughput over systems in service today. The system is described in terms of the services provided and the system architecture being implemented to deliver those services. The system operation is described including examples of a circuit switched and packet switched call placement. The physical architecture is presented showing the major hardware components and software functionality placement within the hardware.
NASA Astrophysics Data System (ADS)
Bellerby, Tim
2015-04-01
PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks < number of processors) or tasks are divided out among the available processors (number of tasks > number of processors). Nested parallel statements may further subdivide the processor set owned by a given task. Tasks or processors are distributed evenly by default, but uneven distributions are possible under programmer control. It is also possible to explicitly enable child tasks to migrate within the processor set owned by their parent task, reducing load unbalancing at the potential cost of increased inter-processor message traffic. PM incorporates some programming structures from the earlier MIST language presented at a previous EGU General Assembly, while adopting a significantly different underlying parallelisation model and type system. PM code is available at www.pm-lang.org under an unrestrictive MIT license. Reference Ruymán Reyes, Antonio J. Dorta, Francisco Almeida, Francisco de Sande, 2009. Automatic Hybrid MPI+OpenMP Code Generation with llc, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science Volume 5759, 185-195
Space biology initiative program definition review. Trade study 4: Design modularity and commonality
NASA Technical Reports Server (NTRS)
Jackson, L. Neal; Crenshaw, John, Sr.; Davidson, William L.; Herbert, Frank J.; Bilodeau, James W.; Stoval, J. Michael; Sutton, Terry
1989-01-01
The relative cost impacts (up or down) of developing Space Biology hardware using design modularity and commonality is studied. Recommendations for how the hardware development should be accomplished to meet optimum design modularity requirements for Life Science investigation hardware will be provided. In addition, the relative cost impacts of implementing commonality of hardware for all Space Biology hardware are defined. Cost analysis and supporting recommendations for levels of modularity and commonality are presented. A mathematical or statistical cost analysis method with the capability to support development of production design modularity and commonality impacts to parametric cost analysis is provided.
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations
Páll, Szilárd; Fechner, Martin; Esztermann, Ansgar; de Groot, Bert L.; Grubmüller, Helmut
2015-01-01
The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well‐exploited with a combination of single instruction multiple data, multithreading, and message passing interface (MPI)‐based single program multiple data/multiple program multiple data parallelism while graphics processing units (GPUs) can be used as accelerators to compute interactions off‐loaded from the CPU. Here, we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance‐to‐price ratio, energy efficiency, and several other criteria. Although hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer‐class GPUs this improvement equally reflects in the performance‐to‐price ratio. Although memory issues in consumer‐class GPUs could pass unnoticed as these cards do not support error checking and correction memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost‐efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well‐balanced ratio of CPU and consumer‐class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime. © 2015 The Authors. Journal of Computational Chemistry Published by Wiley Periodicals, Inc. PMID:26238484
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations.
Kutzner, Carsten; Páll, Szilárd; Fechner, Martin; Esztermann, Ansgar; de Groot, Bert L; Grubmüller, Helmut
2015-10-05
The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well-exploited with a combination of single instruction multiple data, multithreading, and message passing interface (MPI)-based single program multiple data/multiple program multiple data parallelism while graphics processing units (GPUs) can be used as accelerators to compute interactions off-loaded from the CPU. Here, we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance-to-price ratio, energy efficiency, and several other criteria. Although hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer-class GPUs this improvement equally reflects in the performance-to-price ratio. Although memory issues in consumer-class GPUs could pass unnoticed as these cards do not support error checking and correction memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost-efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well-balanced ratio of CPU and consumer-class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime. © 2015 The Authors. Journal of Computational Chemistry Published by Wiley Periodicals, Inc.
NASA Astrophysics Data System (ADS)
Vodenicarevic, D.; Locatelli, N.; Mizrahi, A.; Friedman, J. S.; Vincent, A. F.; Romera, M.; Fukushima, A.; Yakushiji, K.; Kubota, H.; Yuasa, S.; Tiwari, S.; Grollier, J.; Querlioz, D.
2017-11-01
Low-energy random number generation is critical for many emerging computing schemes proposed to complement or replace von Neumann architectures. However, current random number generators are always associated with an energy cost that is prohibitive for these computing schemes. We introduce random number bit generation based on specific nanodevices: superparamagnetic tunnel junctions. We experimentally demonstrate high-quality random bit generation that represents an orders-of-magnitude improvement in energy efficiency over current solutions. We show that the random generation speed improves with nanodevice scaling, and we investigate the impact of temperature, magnetic field, and cross talk. Finally, we show how alternative computing schemes can be implemented using superparamagentic tunnel junctions as random number generators. These results open the way for fabricating efficient hardware computing devices leveraging stochasticity, and they highlight an alternative use for emerging nanodevices.
A software framework for pipelined arithmetic algorithms in field programmable gate arrays
NASA Astrophysics Data System (ADS)
Kim, J. B.; Won, E.
2018-03-01
Pipelined algorithms implemented in field programmable gate arrays are extensively used for hardware triggers in the modern experimental high energy physics field and the complexity of such algorithms increases rapidly. For development of such hardware triggers, algorithms are developed in C++, ported to hardware description language for synthesizing firmware, and then ported back to C++ for simulating the firmware response down to the single bit level. We present a C++ software framework which automatically simulates and generates hardware description language code for pipelined arithmetic algorithms.
Eshraghian, Jason K; Baek, Seungbum; Kim, Jun-Ho; Iannella, Nicolangelo; Cho, Kyoungrok; Goo, Yong Sook; Iu, Herbert H C; Kang, Sung-Mo; Eshraghian, Kamran
2018-02-13
Existing computational models of the retina often compromise between the biophysical accuracy and a hardware-adaptable methodology of implementation. When compared to the current modes of vision restoration, algorithmic models often contain a greater correlation between stimuli and the affected neural network, but lack physical hardware practicality. Thus, if the present processing methods are adapted to complement very-large-scale circuit design techniques, it is anticipated that it will engender a more feasible approach to the physical construction of the artificial retina. The computational model presented in this research serves to provide a fast and accurate predictive model of the retina, a deeper understanding of neural responses to visual stimulation, and an architecture that can realistically be transformed into a hardware device. Traditionally, implicit (or semi-implicit) ordinary differential equations (OES) have been used for optimal speed and accuracy. We present a novel approach that requires the effective integration of different dynamical time scales within a unified framework of neural responses, where the rod, cone, amacrine, bipolar, and ganglion cells correspond to the implemented pathways. Furthermore, we show that adopting numerical integration can both accelerate retinal pathway simulations by more than 50% when compared with traditional ODE solvers in some cases, and prove to be a more realizable solution for the hardware implementation of predictive retinal models.
DOT National Transportation Integrated Search
2013-10-01
In this work, a previously developed structural health monitoring (SHM) system was advanced toward a ready-for-implementation : system. Improvements were made with respect to automated data reduction/analysis, data acquisition hardware, sensor types,...
Differences Between VLBI2010 and S/X Hardware
NASA Technical Reports Server (NTRS)
Corey, Brian
2010-01-01
While the overall architecture is similar for the station hardware in current S/X systems and in the VLBI2010 systems under development, various functions are implemented differently. Some of these differences, and the reasons behind them, are described here.
Safe to Fly: Certifying COTS Hardware for Spaceflight
NASA Technical Reports Server (NTRS)
Fichuk, Jessica L.
2011-01-01
Providing hardware for the astronauts to use on board the Space Shuttle or International Space Station (ISS) involves a certification process that entails evaluating hardware safety, weighing risks, providing mitigation, and verifying requirements. Upon completion of this certification process, the hardware is deemed safe to fly. This process from start to finish can be completed as quickly as 1 week or can take several years in length depending on the complexity of the hardware and whether the item is a unique custom design. One area of cost and schedule savings that NASA implements is buying Commercial Off the Shelf (COTS) hardware and certifying it for human spaceflight as safe to fly. By utilizing commercial hardware, NASA saves time not having to develop, design and build the hardware from scratch, as well as a timesaving in the certification process. By utilizing COTS hardware, the current detailed certification process can be simplified which results in schedule savings. Cost savings is another important benefit of flying COTS hardware. Procuring COTS hardware for space use can be more economical than custom building the hardware. This paper will investigate the cost savings associated with certifying COTS hardware to NASA s standards rather than performing a custom build.
An IoT-Based Solution for Monitoring a Fleet of Educational Buildings Focusing on Energy Efficiency.
Amaxilatis, Dimitrios; Akrivopoulos, Orestis; Mylonas, Georgios; Chatzigiannakis, Ioannis
2017-10-10
Raising awareness among young people and changing their behaviour and habits concerning energy usage is key to achieving sustained energy saving. Additionally, young people are very sensitive to environmental protection so raising awareness among children is much easier than with any other group of citizens. This work examines ways to create an innovative Information & Communication Technologies (ICT) ecosystem (including web-based, mobile, social and sensing elements) tailored specifically for school environments, taking into account both the users (faculty, staff, students, parents) and school buildings, thus motivating and supporting young citizens' behavioural change to achieve greater energy efficiency. A mixture of open-source IoT hardware and proprietary platforms on the infrastructure level, are currently being utilized for monitoring a fleet of 18 educational buildings across 3 countries, comprising over 700 IoT monitoring points. Hereon presented is the system's high-level architecture, as well as several aspects of its implementation, related to the application domain of educational building monitoring and energy efficiency. The system is developed based on open-source technologies and services in order to make it capable of providing open IT-infrastructure and support from different commercial hardware/sensor vendors as well as open-source solutions. The system presented can be used to develop and offer new app-based solutions that can be used either for educational purposes or for managing the energy efficiency of the building. The system is replicable and adaptable to settings that may be different than the scenarios envisioned here (e.g., targeting different climate zones), different IT infrastructures and can be easily extended to accommodate integration with other systems. The overall performance of the system is evaluated in real-world environment in terms of scalability, responsiveness and simplicity.
An IoT-Based Solution for Monitoring a Fleet of Educational Buildings Focusing on Energy Efficiency
Akrivopoulos, Orestis
2017-01-01
Raising awareness among young people and changing their behaviour and habits concerning energy usage is key to achieving sustained energy saving. Additionally, young people are very sensitive to environmental protection so raising awareness among children is much easier than with any other group of citizens. This work examines ways to create an innovative Information & Communication Technologies (ICT) ecosystem (including web-based, mobile, social and sensing elements) tailored specifically for school environments, taking into account both the users (faculty, staff, students, parents) and school buildings, thus motivating and supporting young citizens’ behavioural change to achieve greater energy efficiency. A mixture of open-source IoT hardware and proprietary platforms on the infrastructure level, are currently being utilized for monitoring a fleet of 18 educational buildings across 3 countries, comprising over 700 IoT monitoring points. Hereon presented is the system’s high-level architecture, as well as several aspects of its implementation, related to the application domain of educational building monitoring and energy efficiency. The system is developed based on open-source technologies and services in order to make it capable of providing open IT-infrastructure and support from different commercial hardware/sensor vendors as well as open-source solutions. The system presented can be used to develop and offer new app-based solutions that can be used either for educational purposes or for managing the energy efficiency of the building. The system is replicable and adaptable to settings that may be different than the scenarios envisioned here (e.g., targeting different climate zones), different IT infrastructures and can be easily extended to accommodate integration with other systems. The overall performance of the system is evaluated in real-world environment in terms of scalability, responsiveness and simplicity. PMID:28994719
Radiology on handheld devices: image display, manipulation, and PACS integration issues.
Raman, Bhargav; Raman, Raghav; Raman, Lalithakala; Beaulieu, Christopher F
2004-01-01
Handheld personal digital assistants (PDAs) have undergone continuous and substantial improvements in hardware and graphics capabilities, making them a compelling platform for novel developments in teleradiology. The latest PDAs have processor speeds of up to 400 MHz and storage capacities of up to 80 Gbytes with memory expansion methods. A Digital Imaging and Communications in Medicine (DICOM)-compliant, vendor-independent handheld image access system was developed in which a PDA server acts as the gateway between a picture archiving and communication system (PACS) and PDAs. The system is compatible with most currently available PDA models. It is capable of both wired and wireless transfer of images and includes custom PDA software and World Wide Web interfaces that implement a variety of basic image manipulation functions. Implementation of this system, which is currently undergoing debugging and beta testing, required optimization of the user interface to efficiently display images on smaller PDA screens. The PDA server manages user work lists and implements compression and security features to accelerate transfer speeds, protect patient information, and regulate access. Although some limitations remain, PDA-based teleradiology has the potential to increase the efficiency of the radiologic work flow, increasing productivity and improving communication with referring physicians and patients. Copyright RSNA, 2004
Unified transform architecture for AVC, AVS, VC-1 and HEVC high-performance codecs
NASA Astrophysics Data System (ADS)
Dias, Tiago; Roma, Nuno; Sousa, Leonel
2014-12-01
A unified architecture for fast and efficient computation of the set of two-dimensional (2-D) transforms adopted by the most recent state-of-the-art digital video standards is presented in this paper. Contrasting to other designs with similar functionality, the presented architecture is supported on a scalable, modular and completely configurable processing structure. This flexible structure not only allows to easily reconfigure the architecture to support different transform kernels, but it also permits its resizing to efficiently support transforms of different orders (e.g. order-4, order-8, order-16 and order-32). Consequently, not only is it highly suitable to realize high-performance multi-standard transform cores, but it also offers highly efficient implementations of specialized processing structures addressing only a reduced subset of transforms that are used by a specific video standard. The experimental results that were obtained by prototyping several configurations of this processing structure in a Xilinx Virtex-7 FPGA show the superior performance and hardware efficiency levels provided by the proposed unified architecture for the implementation of transform cores for the Advanced Video Coding (AVC), Audio Video coding Standard (AVS), VC-1 and High Efficiency Video Coding (HEVC) standards. In addition, such results also demonstrate the ability of this processing structure to realize multi-standard transform cores supporting all the standards mentioned above and that are capable of processing the 8k Ultra High Definition Television (UHDTV) video format (7,680 × 4,320 at 30 fps) in real time.
Maximum-Likelihood Estimation With a Contracting-Grid Search Algorithm
Hesterman, Jacob Y.; Caucci, Luca; Kupinski, Matthew A.; Barrett, Harrison H.; Furenlid, Lars R.
2010-01-01
A fast search algorithm capable of operating in multi-dimensional spaces is introduced. As a sample application, we demonstrate its utility in the 2D and 3D maximum-likelihood position-estimation problem that arises in the processing of PMT signals to derive interaction locations in compact gamma cameras. We demonstrate that the algorithm can be parallelized in pipelines, and thereby efficiently implemented in specialized hardware, such as field-programmable gate arrays (FPGAs). A 2D implementation of the algorithm is achieved in Cell/BE processors, resulting in processing speeds above one million events per second, which is a 20× increase in speed over a conventional desktop machine. Graphics processing units (GPUs) are used for a 3D application of the algorithm, resulting in processing speeds of nearly 250,000 events per second which is a 250× increase in speed over a conventional desktop machine. These implementations indicate the viability of the algorithm for use in real-time imaging applications. PMID:20824155
NASA Technical Reports Server (NTRS)
Metcalfe, A. G.; Bodenheimer, R. E.
1976-01-01
A parallel algorithm for counting the number of logic-l elements in a binary array or image developed during preliminary investigation of the Tse concept is described. The counting algorithm is implemented using a basic combinational structure. Modifications which improve the efficiency of the basic structure are also presented. A programmable Tse computer structure is proposed, along with a hardware control unit, Tse instruction set, and software program for execution of the counting algorithm. Finally, a comparison is made between the different structures in terms of their more important characteristics.
The artificial retina processor for track reconstruction at the LHC crossing rate
Abba, A.; Bedeschi, F.; Citterio, M.; ...
2015-03-16
We present results of an R&D study for a specialized processor capable of precisely reconstructing, in pixel detectors, hundreds of charged-particle tracks from high-energy collisions at 40 MHz rate. We apply a highly parallel pattern-recognition algorithm, inspired by studies of the processing of visual images by the brain as it happens in nature, and describe in detail an efficient hardware implementation in high-speed, high-bandwidth FPGA devices. This is the first detailed demonstration of reconstruction of offline-quality tracks at 40 MHz and makes the device suitable for processing Large Hadron Collider events at the full crossing frequency.
Mitchell, Wayne; Breen, Colin; Entzeroth, Michael
2008-03-01
The Experimental Therapeutics Center (ETC) has been established at Biopolis to advance translational research by bridging the gap between discovery science and commercialization. We describe the Electronic Research Habitat at ETC, a comprehensive hardware and software infrastructure designed to effectively manage terabyte data flows and storage, increase back office efficiency, enhance the scientific work experience, and satisfy rigorous regulatory and legal requirements. Our habitat design is secure, scalable and robust, and it strives to embody the core values of the knowledge-based workplace, thus contributing to the strategic goal of building a "knowledge economy" in the context of Singapore's on-going biotechnology initiative.
Development of an Ultraflex-Based Thin Film Solar Array for Space Applications
NASA Technical Reports Server (NTRS)
White, Steve; Douglas, Mark; Spence, Brian; Jones, P. Alan; Piszczor, Michael F.
2003-01-01
As flexible thin film photovoltaic (FTFPV) cell technology is developed for space applications, integration into a viable solar array structure that optimizes the attributes of this cell technology is critical. An advanced version of ABLE'sS UltraFlex solar array platform represents a near-term, low-risk approach to demonstrating outstanding array performance with the implementation of FTFPV technology. Recent studies indicate that an advanced UltraFlex solar array populated with 15% efficient thin film cells can achieve over 200 W/kg EOL. An overview on the status of hardware development and the future potential of this technology is presented.
NASA Technical Reports Server (NTRS)
Tuccillo, J. J.
1984-01-01
Numerical Weather Prediction (NWP), for both operational and research purposes, requires only fast computational speed but also large memory. A technique for solving the Primitive Equations for atmospheric motion on the CYBER 205, as implemented in the Mesoscale Atmospheric Simulation System, which is fully vectorized and requires substantially less memory than other techniques such as the Leapfrog or Adams-Bashforth Schemes is discussed. The technique presented uses the Euler-Backard time marching scheme. Also discussed are several techniques for reducing computational time of the model by replacing slow intrinsic routines by faster algorithms which use only hardware vector instructions.
Flight testing the digital electronic engine control in the F-15 airplane
NASA Technical Reports Server (NTRS)
Myers, L. P.
1984-01-01
The digital electronic engine control (DEEC) is a full-authority digital engine control developed for the F100-PW-100 turbofan engine which was flight tested on an F-15 aircraft. The DEEC hardware and software throughout the F-15 flight envelope was evaluated. Real-time data reduction and data display systems were implemented. New test techniques and stronger coordination between the propulsion test engineer and pilot were developed which produced efficient use of test time, reduced pilot work load, and greatly improved quality data. The engine pressure ratio (EPR) control mode is demonstrated. It is found that the nonaugmented throttle transients and engine performance are satisfactory.
Complexity of the Quantum Adiabatic Algorithm
NASA Astrophysics Data System (ADS)
Hen, Itay
2013-03-01
The Quantum Adiabatic Algorithm (QAA) has been proposed as a mechanism for efficiently solving optimization problems on a quantum computer. Since adiabatic computation is analog in nature and does not require the design and use of quantum gates, it can be thought of as a simpler and perhaps more profound method for performing quantum computations that might also be easier to implement experimentally. While these features have generated substantial research in QAA, to date there is still a lack of solid evidence that the algorithm can outperform classical optimization algorihms. Here, we discuss several aspects of the quantum adiabatic algorithm: We analyze the efficiency of the algorithm on several ``hard'' (NP) computational problems. Studying the size dependence of the typical minimum energy gap of the Hamiltonians of these problems using quantum Monte Carlo methods, we find that while for most problems the minimum gap decreases exponentially with the size of the problem, indicating that the QAA is not more efficient than existing classical search algorithms, for other problems there is evidence to suggest that the gap may be polynomial near the phase transition. We also discuss applications of the QAA to ``real life'' problems and how they can be implemented on currently available (albeit prototypical) quantum hardware such as ``D-Wave One'', that impose serious restrictions as to which type of problems may be tested. Finally, we discuss different approaches to find improved implementations of the algorithm such as local adiabatic evolution, adaptive methods, local search in Hamiltonian space and others.
FUX-Sim: Implementation of a fast universal simulation/reconstruction framework for X-ray systems.
Abella, Monica; Serrano, Estefania; Garcia-Blas, Javier; García, Ines; de Molina, Claudia; Carretero, Jesus; Desco, Manuel
2017-01-01
The availability of digital X-ray detectors, together with advances in reconstruction algorithms, creates an opportunity for bringing 3D capabilities to conventional radiology systems. The downside is that reconstruction algorithms for non-standard acquisition protocols are generally based on iterative approaches that involve a high computational burden. The development of new flexible X-ray systems could benefit from computer simulations, which may enable performance to be checked before expensive real systems are implemented. The development of simulation/reconstruction algorithms in this context poses three main difficulties. First, the algorithms deal with large data volumes and are computationally expensive, thus leading to the need for hardware and software optimizations. Second, these optimizations are limited by the high flexibility required to explore new scanning geometries, including fully configurable positioning of source and detector elements. And third, the evolution of the various hardware setups increases the effort required for maintaining and adapting the implementations to current and future programming models. Previous works lack support for completely flexible geometries and/or compatibility with multiple programming models and platforms. In this paper, we present FUX-Sim, a novel X-ray simulation/reconstruction framework that was designed to be flexible and fast. Optimized implementation for different families of GPUs (CUDA and OpenCL) and multi-core CPUs was achieved thanks to a modularized approach based on a layered architecture and parallel implementation of the algorithms for both architectures. A detailed performance evaluation demonstrates that for different system configurations and hardware platforms, FUX-Sim maximizes performance with the CUDA programming model (5 times faster than other state-of-the-art implementations). Furthermore, the CPU and OpenCL programming models allow FUX-Sim to be executed over a wide range of hardware platforms.
Kaliman, Ilya A; Krylov, Anna I
2017-04-30
A new hardware-agnostic contraction algorithm for tensors of arbitrary symmetry and sparsity is presented. The algorithm is implemented as a stand-alone open-source code libxm. This code is also integrated with general tensor library libtensor and with the Q-Chem quantum-chemistry package. An overview of the algorithm, its implementation, and benchmarks are presented. Similarly to other tensor software, the algorithm exploits efficient matrix multiplication libraries and assumes that tensors are stored in a block-tensor form. The distinguishing features of the algorithm are: (i) efficient repackaging of the individual blocks into large matrices and back, which affords efficient graphics processing unit (GPU)-enabled calculations without modifications of higher-level codes; (ii) fully asynchronous data transfer between disk storage and fast memory. The algorithm enables canonical all-electron coupled-cluster and equation-of-motion coupled-cluster calculations with single and double substitutions (CCSD and EOM-CCSD) with over 1000 basis functions on a single quad-GPU machine. We show that the algorithm exhibits predicted theoretical scaling for canonical CCSD calculations, O(N 6 ), irrespective of the data size on disk. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
Ultralow power artificial synapses using nanotextured magnetic Josephson junctions.
Schneider, Michael L; Donnelly, Christine A; Russek, Stephen E; Baek, Burm; Pufall, Matthew R; Hopkins, Peter F; Dresselhaus, Paul D; Benz, Samuel P; Rippard, William H
2018-01-01
Neuromorphic computing promises to markedly improve the efficiency of certain computational tasks, such as perception and decision-making. Although software and specialized hardware implementations of neural networks have made tremendous accomplishments, both implementations are still many orders of magnitude less energy efficient than the human brain. We demonstrate a new form of artificial synapse based on dynamically reconfigurable superconducting Josephson junctions with magnetic nanoclusters in the barrier. The spiking energy per pulse varies with the magnetic configuration, but in our demonstration devices, the spiking energy is always less than 1 aJ. This compares very favorably with the roughly 10 fJ per synaptic event in the human brain. Each artificial synapse is composed of a Si barrier containing Mn nanoclusters with superconducting Nb electrodes. The critical current of each synapse junction, which is analogous to the synaptic weight, can be tuned using input voltage spikes that change the spin alignment of Mn nanoclusters. We demonstrate synaptic weight training with electrical pulses as small as 3 aJ. Further, the Josephson plasma frequencies of the devices, which determine the dynamical time scales, all exceed 100 GHz. These new artificial synapses provide a significant step toward a neuromorphic platform that is faster, more energy-efficient, and thus can attain far greater complexity than has been demonstrated with other technologies.
Ultralow power artificial synapses using nanotextured magnetic Josephson junctions
Schneider, Michael L.; Donnelly, Christine A.; Russek, Stephen E.; Baek, Burm; Pufall, Matthew R.; Hopkins, Peter F.; Dresselhaus, Paul D.; Benz, Samuel P.; Rippard, William H.
2018-01-01
Neuromorphic computing promises to markedly improve the efficiency of certain computational tasks, such as perception and decision-making. Although software and specialized hardware implementations of neural networks have made tremendous accomplishments, both implementations are still many orders of magnitude less energy efficient than the human brain. We demonstrate a new form of artificial synapse based on dynamically reconfigurable superconducting Josephson junctions with magnetic nanoclusters in the barrier. The spiking energy per pulse varies with the magnetic configuration, but in our demonstration devices, the spiking energy is always less than 1 aJ. This compares very favorably with the roughly 10 fJ per synaptic event in the human brain. Each artificial synapse is composed of a Si barrier containing Mn nanoclusters with superconducting Nb electrodes. The critical current of each synapse junction, which is analogous to the synaptic weight, can be tuned using input voltage spikes that change the spin alignment of Mn nanoclusters. We demonstrate synaptic weight training with electrical pulses as small as 3 aJ. Further, the Josephson plasma frequencies of the devices, which determine the dynamical time scales, all exceed 100 GHz. These new artificial synapses provide a significant step toward a neuromorphic platform that is faster, more energy-efficient, and thus can attain far greater complexity than has been demonstrated with other technologies. PMID:29387787
DOE Office of Scientific and Technical Information (OSTI.GOV)
Levine, Benjamin G., E-mail: ben.levine@temple.ed; Stone, John E., E-mail: johns@ks.uiuc.ed; Kohlmeyer, Axel, E-mail: akohlmey@temple.ed
2011-05-01
The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU's memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm aremore » presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 s per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis.« less
Stone, John E.; Kohlmeyer, Axel
2011-01-01
The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU’s memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm are presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 seconds per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis. PMID:21547007
Fundamentals of Hardware. Curriculum Improvement Project. Region II.
ERIC Educational Resources Information Center
Onabajo, Femi
This course curriculum is intended for use by community college instructors and administrators in implementing a fundamentals in hardware course. A student's course syllabus provides this information: credit hours, catalog description, prerequisites, required text, instructional process, objectives, student evaluation, and class schedule. A…
FPGA Coprocessor for Accelerated Classification of Images
NASA Technical Reports Server (NTRS)
Pingree, Paula J.; Scharenbroich, Lucas J.; Werne, Thomas A.
2008-01-01
An effort related to that described in the preceding article focuses on developing a spaceborne processing platform for fast and accurate onboard classification of image data, a critical part of modern satellite image processing. The approach again has been to exploit the versatility of recently developed hybrid Virtex-4FX field-programmable gate array (FPGA) to run diverse science applications on embedded processors while taking advantage of the reconfigurable hardware resources of the FPGAs. In this case, the FPGA serves as a coprocessor that implements legacy C-language support-vector-machine (SVM) image-classification algorithms to detect and identify natural phenomena such as flooding, volcanic eruptions, and sea-ice break-up. The FPGA provides hardware acceleration for increased onboard processing capability than previously demonstrated in software. The original C-language program demonstrated on an imaging instrument aboard the Earth Observing-1 (EO-1) satellite implements a linear-kernel SVM algorithm for classifying parts of the images as snow, water, ice, land, or cloud or unclassified. Current onboard processors, such as on EO-1, have limited computing power, extremely limited active storage capability and are no longer considered state-of-the-art. Using commercially available software that translates C-language programs into hardware description language (HDL) files, the legacy C-language program, and two newly formulated programs for a more capable expanded-linear-kernel and a more accurate polynomial-kernel SVM algorithm, have been implemented in the Virtex-4FX FPGA. In tests, the FPGA implementations have exhibited significant speedups over conventional software implementations running on general-purpose hardware.
Known-plaintext attack on the double phase encoding and its implementation with parallel hardware
NASA Astrophysics Data System (ADS)
Wei, Hengzheng; Peng, Xiang; Liu, Haitao; Feng, Songlin; Gao, Bruce Z.
2008-03-01
A known-plaintext attack on the double phase encryption scheme implemented with parallel hardware is presented. The double random phase encoding (DRPE) is one of the most representative optical cryptosystems developed in mid of 90's and derives quite a few variants since then. Although the DRPE encryption system has a strong power resisting to a brute-force attack, the inherent architecture of DRPE leaves a hidden trouble due to its linearity nature. Recently the real security strength of this opto-cryptosystem has been doubted and analyzed from the cryptanalysis point of view. In this presentation, we demonstrate that the optical cryptosystems based on DRPE architecture are vulnerable to known-plain text attack. With this attack the two encryption keys in the DRPE can be accessed with the help of the phase retrieval technique. In our approach, we adopt hybrid input-output algorithm (HIO) to recover the random phase key in the object domain and then infer the key in frequency domain. Only a plaintext-ciphertext pair is sufficient to create vulnerability. Moreover this attack does not need to select particular plaintext. The phase retrieval technique based on HIO is an iterative process performing Fourier transforms, so it fits very much into the hardware implementation of the digital signal processor (DSP). We make use of the high performance DSP to accomplish the known-plaintext attack. Compared with the software implementation, the speed of the hardware implementation is much fast. The performance of this DSP-based cryptanalysis system is also evaluated.
High Frequency SSVEP-BCI With Hardware Stimuli Control and Phase-Synchronized Comb Filter.
Chabuda, Anna; Durka, Piotr; Zygierewicz, Jaroslaw
2018-02-01
We present an efficient implementation of brain-computer interface (BCI) based on high-frequency steady state visually evoked potentials (SSVEP). Individual shape of the SSVEP response is extracted by means of a feedforward comb filter, which adds delayed versions of the signal to itself. Rendering of the stimuli is controlled by specialized hardware (BCI Appliance). Out of 15 participants of the study, nine were able to produce stable response in at least eight out of ten frequencies from the 30-39 Hz range. They achieved on average 96±4% accuracy and 47±5 bit/min information transfer rate (ITR) for an optimized simple seven-letter speller, while generic full-alphabet speller allowed in this group for 89±9% accuracy and 36±9 bit/min ITR. These values exceed the performances of high-frequency SSVEP-BCI systems reported to date. Classical approach to SSVEP parameterization by relative spectral power in the frequencies of stimulation, implemented on the same data, resulted in significantly lower performance. This suggests that specific shape of the response is an important feature in classification. Finally, we discuss the differences in SSVEP responses of the participants who were able or unable to use the interface, as well as the statistically significant influence of the layout of the speller on the speed of BCI operation.
Power and Performance Trade-offs for Space Time Adaptive Processing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gawande, Nitin A.; Manzano Franco, Joseph B.; Tumeo, Antonino
Computational efficiency – performance relative to power or energy – is one of the most important concerns when designing RADAR processing systems. This paper analyzes power and performance trade-offs for a typical Space Time Adaptive Processing (STAP) application. We study STAP implementations for CUDA and OpenMP on two computationally efficient architectures, Intel Haswell Core I7-4770TE and NVIDIA Kayla with a GK208 GPU. We analyze the power and performance of STAP’s computationally intensive kernels across the two hardware testbeds. We also show the impact and trade-offs of GPU optimization techniques. We show that data parallelism can be exploited for efficient implementationmore » on the Haswell CPU architecture. The GPU architecture is able to process large size data sets without increase in power requirement. The use of shared memory has a significant impact on the power requirement for the GPU. A balance between the use of shared memory and main memory access leads to an improved performance in a typical STAP application.« less
Tang, Wenming; Liu, Guixiong; Li, Yuzhong; Tan, Daji
2017-01-01
High data transmission efficiency is a key requirement for an ultrasonic phased array with multi-group ultrasonic sensors. Here, a novel FIFOs scheduling algorithm was proposed and the data transmission efficiency with hardware technology was improved. This algorithm includes FIFOs as caches for the ultrasonic scanning data obtained from the sensors with the output data in a bandwidth-sharing way, on the basis of which an optimal length ratio of all the FIFOs is achieved, allowing the reading operations to be switched among all the FIFOs without time slot waiting. Therefore, this algorithm enhances the utilization ratio of the reading bandwidth resources so as to obtain higher efficiency than the traditional scheduling algorithms. The reliability and validity of the algorithm are substantiated after its implementation in the field programmable gate array (FPGA) technology, and the bandwidth utilization ratio and the real-time performance of the ultrasonic phased array are enhanced. PMID:29035345
Fly-By-Light/Power-By-Wire Fault-Tolerant Fiber-Optic Backplane
NASA Technical Reports Server (NTRS)
Malekpour, Mahyar R.
2002-01-01
The design and development of a fault-tolerant fiber-optic backplane to demonstrate feasibility of such architecture is presented. The simulation results of test cases on the backplane in the advent of induced faults are presented, and the fault recovery capability of the architecture is demonstrated. The architecture was designed, developed, and implemented using the Very High Speed Integrated Circuits (VHSIC) Hardware Description Language (VHDL). The architecture was synthesized and implemented in hardware using Field Programmable Gate Arrays (FPGA) on multiple prototype boards.
Hardware description languages
NASA Technical Reports Server (NTRS)
Tucker, Jerry H.
1994-01-01
Hardware description languages are special purpose programming languages. They are primarily used to specify the behavior of digital systems and are rapidly replacing traditional digital system design techniques. This is because they allow the designer to concentrate on how the system should operate rather than on implementation details. Hardware description languages allow a digital system to be described with a wide range of abstraction, and they support top down design techniques. A key feature of any hardware description language environment is its ability to simulate the modeled system. The two most important hardware description languages are Verilog and VHDL. Verilog has been the dominant language for the design of application specific integrated circuits (ASIC's). However, VHDL is rapidly gaining in popularity.
FPS-RAM: Fast Prefix Search RAM-Based Hardware for Forwarding Engine
NASA Astrophysics Data System (ADS)
Zaitsu, Kazuya; Yamamoto, Koji; Kuroda, Yasuto; Inoue, Kazunari; Ata, Shingo; Oka, Ikuo
Ternary content addressable memory (TCAM) is becoming very popular for designing high-throughput forwarding engines on routers. However, TCAM has potential problems in terms of hardware and power costs, which limits its ability to deploy large amounts of capacity in IP routers. In this paper, we propose new hardware architecture for fast forwarding engines, called fast prefix search RAM-based hardware (FPS-RAM). We designed FPS-RAM hardware with the intent of maintaining the same search performance and physical user interface as TCAM because our objective is to replace the TCAM in the market. Our RAM-based hardware architecture is completely different from that of TCAM and has dramatically reduced the costs and power consumption to 62% and 52%, respectively. We implemented FPS-RAM on an FPGA to examine its lookup operation.
Hardware Implementation of Maximum Power Point Tracking for Thermoelectric Generators
NASA Astrophysics Data System (ADS)
Maganga, Othman; Phillip, Navneesh; Burnham, Keith J.; Montecucco, Andrea; Siviter, Jonathan; Knox, Andrew; Simpson, Kevin
2014-06-01
This work describes the practical implementation of two maximum power point tracking (MPPT) algorithms, namely those of perturb and observe, and extremum seeking control. The proprietary dSPACE system is used to perform hardware in the loop (HIL) simulation whereby the two control algorithms are implemented using the MATLAB/Simulink (Mathworks, Natick, MA) software environment in order to control a synchronous buck-boost converter connected to two commercial thermoelectric modules. The process of performing HIL simulation using dSPACE is discussed, and a comparison between experimental and simulated results is highlighted. The experimental results demonstrate the validity of the two MPPT algorithms, and in conclusion the benefits and limitations of real-time implementation of MPPT controllers using dSPACE are discussed.
Piromalis, Dimitrios; Arvanitis, Konstantinos
2016-01-01
Wireless Sensor and Actuators Networks (WSANs) constitute one of the most challenging technologies with tremendous socio-economic impact for the next decade. Functionally and energy optimized hardware systems and development tools maybe is the most critical facet of this technology for the achievement of such prospects. Especially, in the area of agriculture, where the hostile operating environment comes to add to the general technological and technical issues, reliable and robust WSAN systems are mandatory. This paper focuses on the hardware design architectures of the WSANs for real-world agricultural applications. It presents the available alternatives in hardware design and identifies their difficulties and problems for real-life implementations. The paper introduces SensoTube, a new WSAN hardware architecture, which is proposed as a solution to the various existing design constraints of WSANs. The establishment of the proposed architecture is based, firstly on an abstraction approach in the functional requirements context, and secondly, on the standardization of the subsystems connectivity, in order to allow for an open, expandable, flexible, reconfigurable, energy optimized, reliable and robust hardware system. The SensoTube implementation reference model together with its encapsulation design and installation are analyzed and presented in details. Furthermore, as a proof of concept, certain use cases have been studied in order to demonstrate the benefits of migrating existing designs based on the available open-source hardware platforms to SensoTube architecture. PMID:27527180
Microsupercomputers: Design and Implementation
1991-03-01
been ported to the DASH hardware. Hardware problems and software problems with DPV itself prevented its use as a debugging tool until recently. Both the...M.PD) [21]. an LU- decomposition program (LU). and a digita! logic simulation prgram 1 Introduction (PTHOR) [28]. The applcations are typical of those
NASA Technical Reports Server (NTRS)
1975-01-01
The extent was investigated to which experiment hardware and operational requirements can be met by automatic control and material handling devices; payload and system concepts that make extensive use of automation technology are defined. Hardware requirements for each experiment were established and tabulated, and investigations of applicable existing hardware were documented. The capabilities and characteristics of industrial automation equipment, controls, and techniques are presented in the form of a summary of applicable equipment characteristics in three basic mutually-supporting formats. Facilities for performing groups of experiments are defined along with four levitation groups and three furnace groups; major hardware elements required to implement them were identified. A conceptual design definition of ten different automated processing facilities is presented along with the specific equipment to implement each facility and the design layouts of the different units. Constraints and packaging, weight, and power requirements for six payloads postulated for shuttle missions in the 1979 to 1982 time period were examined.
Establishing a Novel Modeling Tool: A Python-Based Interface for a Neuromorphic Hardware System
Brüderle, Daniel; Müller, Eric; Davison, Andrew; Muller, Eilif; Schemmel, Johannes; Meier, Karlheinz
2008-01-01
Neuromorphic hardware systems provide new possibilities for the neuroscience modeling community. Due to the intrinsic parallelism of the micro-electronic emulation of neural computation, such models are highly scalable without a loss of speed. However, the communities of software simulator users and neuromorphic engineering in neuroscience are rather disjoint. We present a software concept that provides the possibility to establish such hardware devices as valuable modeling tools. It is based on the integration of the hardware interface into a simulator-independent language which allows for unified experiment descriptions that can be run on various simulation platforms without modification, implying experiment portability and a huge simplification of the quantitative comparison of hardware and simulator results. We introduce an accelerated neuromorphic hardware device and describe the implementation of the proposed concept for this system. An example setup and results acquired by utilizing both the hardware system and a software simulator are demonstrated. PMID:19562085
Establishing a novel modeling tool: a python-based interface for a neuromorphic hardware system.
Brüderle, Daniel; Müller, Eric; Davison, Andrew; Muller, Eilif; Schemmel, Johannes; Meier, Karlheinz
2009-01-01
Neuromorphic hardware systems provide new possibilities for the neuroscience modeling community. Due to the intrinsic parallelism of the micro-electronic emulation of neural computation, such models are highly scalable without a loss of speed. However, the communities of software simulator users and neuromorphic engineering in neuroscience are rather disjoint. We present a software concept that provides the possibility to establish such hardware devices as valuable modeling tools. It is based on the integration of the hardware interface into a simulator-independent language which allows for unified experiment descriptions that can be run on various simulation platforms without modification, implying experiment portability and a huge simplification of the quantitative comparison of hardware and simulator results. We introduce an accelerated neuromorphic hardware device and describe the implementation of the proposed concept for this system. An example setup and results acquired by utilizing both the hardware system and a software simulator are demonstrated.
Zgaren, Mohamed; Moradi, Arash; Sawan, Mohamad
2015-01-01
Energy-efficient and high-data rate are desired in biomedical devices transceivers. A high-performance transmitter (Tx) and an ultra-low-power receiver (Rx) dedicated to medical implants communications operating at Industrial, Scientific and Medical (ISM) frequency band are presented. Tx benefits from a new efficient Frequency-Shift Keying (FSK) modulation technique which provides up to 20 Mb/s of data-rate and consumes only 0.084 nJ/b validated through fabrication. The receiver consists of an FSK-to-ASK conversion based receiver with OOK fully passive wake-up device (WuRx). This WuRx is battery less with energy harvesting technique which plays an important role in making the RF transceiver energy-efficient. The Rx is achieved with a reduced hardware architecture which does not use an accurate local oscillator, high-Q external inductor and I/Q signal path. The Rx shows -78 dBm sensitivity for 8 Mbps data rate while consuming 639 μW. The proposed circuits are implemented in IBM 0.13 μm CMOS technology with 1.2 V supply voltage.
Current And Future Directions Of Lens Design Software
NASA Astrophysics Data System (ADS)
Gustafson, Darryl E.
1983-10-01
The most effective environment for doing lens design continues to evolve as new computer hardware and software tools become available. Important recent hardware developments include: Low-cost but powerful interactive multi-user 32 bit computers with virtual memory that are totally software-compatible with prior larger and more expensive members of the family. A rapidly growing variety of graphics devices for both hard-copy and screen graphics, including many with color capability. In addition, with optical design software readily accessible in many forms, optical design has become a part-time activity for a large number of engineers instead of being restricted to a small number of full-time specialists. A designer interface that is friendly for the part-time user while remaining efficient for the full-time designer is thus becoming more important as well as more practical. Along with these developments, software tools in other scientific and engineering disciplines are proliferating. Thus, the optical designer is less and less unique in his use of computer-aided techniques and faces the challenge and opportunity of efficiently communicating his designs to other computer-aided-design (CAD), computer-aided-manufacturing (CAM), structural, thermal, and mechanical software tools. This paper will address the impact of these developments on the current and future directions of the CODE VTM optical design software package, its implementation, and the resulting lens design environment.
What Scientific Applications can Benefit from Hardware Transactional Memory?
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schindewolf, M; Bihari, B; Gyllenhaal, J
2012-06-04
Achieving efficient and correct synchronization of multiple threads is a difficult and error-prone task at small scale and, as we march towards extreme scale computing, will be even more challenging when the resulting application is supposed to utilize millions of cores efficiently. Transactional Memory (TM) is a promising technique to ease the burden on the programmer, but only recently has become available on commercial hardware in the new Blue Gene/Q system and hence the real benefit for realistic applications has not been studied, yet. This paper presents the first performance results of TM embedded into OpenMP on a prototype systemmore » of BG/Q and characterizes code properties that will likely lead to benefits when augmented with TM primitives. We first, study the influence of thread count, environment variables and memory layout on TM performance and identify code properties that will yield performance gains with TM. Second, we evaluate the combination of OpenMP with multiple synchronization primitives on top of MPI to determine suitable task to thread ratios per node. Finally, we condense our findings into a set of best practices. These are applied to a Monte Carlo Benchmark and a Smoothed Particle Hydrodynamics method. In both cases an optimized TM version, executed with 64 threads on one node, outperforms a simple TM implementation. MCB with optimized TM yields a speedup of 27.45 over baseline.« less
Mahdiani, Hamid Reza; Fakhraie, Sied Mehdi; Lucas, Caro
2012-08-01
Reliability should be identified as the most important challenge in future nano-scale very large scale integration (VLSI) implementation technologies for the development of complex integrated systems. Normally, fault tolerance (FT) in a conventional system is achieved by increasing its redundancy, which also implies higher implementation costs and lower performance that sometimes makes it even infeasible. In contrast to custom approaches, a new class of applications is categorized in this paper, which is inherently capable of absorbing some degrees of vulnerability and providing FT based on their natural properties. Neural networks are good indicators of imprecision-tolerant applications. We have also proposed a new class of FT techniques called relaxed fault-tolerant (RFT) techniques which are developed for VLSI implementation of imprecision-tolerant applications. The main advantage of RFT techniques with respect to traditional FT solutions is that they exploit inherent FT of different applications to reduce their implementation costs while improving their performance. To show the applicability as well as the efficiency of the RFT method, the experimental results for implementation of a face-recognition computationally intensive neural network and its corresponding RFT realization are presented in this paper. The results demonstrate promising higher performance of artificial neural network VLSI solutions for complex applications in faulty nano-scale implementation environments.
Implementation of Multispectral Image Classification on a Remote Adaptive Computer
NASA Technical Reports Server (NTRS)
Figueiredo, Marco A.; Gloster, Clay S.; Stephens, Mark; Graves, Corey A.; Nakkar, Mouna
1999-01-01
As the demand for higher performance computers for the processing of remote sensing science algorithms increases, the need to investigate new computing paradigms its justified. Field Programmable Gate Arrays enable the implementation of algorithms at the hardware gate level, leading to orders of m a,gnitude performance increase over microprocessor based systems. The automatic classification of spaceborne multispectral images is an example of a computation intensive application, that, can benefit from implementation on an FPGA - based custom computing machine (adaptive or reconfigurable computer). A probabilistic neural network is used here to classify pixels of of a multispectral LANDSAT-2 image. The implementation described utilizes Java client/server application programs to access the adaptive computer from a remote site. Results verify that a remote hardware version of the algorithm (implemented on an adaptive computer) is significantly faster than a local software version of the same algorithm implemented on a typical general - purpose computer).
Motion compensation in digital subtraction angiography using graphics hardware.
Deuerling-Zheng, Yu; Lell, Michael; Galant, Adam; Hornegger, Joachim
2006-07-01
An inherent disadvantage of digital subtraction angiography (DSA) is its sensitivity to patient motion which causes artifacts in the subtraction images. These artifacts could often reduce the diagnostic value of this technique. Automated, fast and accurate motion compensation is therefore required. To cope with this requirement, we first examine a method explicitly designed to detect local motions in DSA. Then, we implement a motion compensation algorithm by means of block matching on modern graphics hardware. Both methods search for maximal local similarity by evaluating a histogram-based measure. In this context, we are the first who have mapped an optimizing search strategy on graphics hardware while paralleling block matching. Moreover, we provide an innovative method for creating histograms on graphics hardware with vertex texturing and frame buffer blending. It turns out that both methods can effectively correct the artifacts in most case, as the hardware implementation of block matching performs much faster: the displacements of two 1024 x 1024 images can be calculated at 3 frames/s with integer precision or 2 frames/s with sub-pixel precision. Preliminary clinical evaluation indicates that the computation with integer precision could already be sufficient.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chambon, Paul; Curran, Scott; Huff, Shean
Rapid vehicle and powertrain development has become essential to for the design and implementation of vehicles that meet and exceed the fuel efficiency, cost, and performance targets expected by today’s consumer while keeping pace with reduced development cycle and more frequent product releases. Advances in large-scale additive manufacturing have provided the means to bridge hardware-in-the-loop (HIL) experimentation and preproduction mule chassis evaluation, recently. Our paper details the accelerated development of a printed range-extended electric vehicle (REEV) by Oak Ridge National Laboratory, by paralleling hardware-in-the-loop development of the powertrain with rapid chassis prototyping using big area additive manufacturing (BAAM). BAAM’s abilitymore » to accelerate the mule vehicle development from computer-aided design to vehicle build is explored. The use of a hardware-in-the-loop laboratory is described as it is applied to the design of a range-extended electric powertrain to be installed in a printed prototype vehicle. Furthermore, the integration of the powertrain and the opportunities and challenges it presents are described in this work. A comparison of offline simulation, HIL and chassis rolls results is presented to validate the development process. Chassis dynamometer results for battery electric and range extender operation are analyzed to show the benefits of the architecture.« less
Chambon, Paul; Curran, Scott; Huff, Shean; ...
2017-01-29
Rapid vehicle and powertrain development has become essential to for the design and implementation of vehicles that meet and exceed the fuel efficiency, cost, and performance targets expected by today’s consumer while keeping pace with reduced development cycle and more frequent product releases. Advances in large-scale additive manufacturing have provided the means to bridge hardware-in-the-loop (HIL) experimentation and preproduction mule chassis evaluation, recently. Our paper details the accelerated development of a printed range-extended electric vehicle (REEV) by Oak Ridge National Laboratory, by paralleling hardware-in-the-loop development of the powertrain with rapid chassis prototyping using big area additive manufacturing (BAAM). BAAM’s abilitymore » to accelerate the mule vehicle development from computer-aided design to vehicle build is explored. The use of a hardware-in-the-loop laboratory is described as it is applied to the design of a range-extended electric powertrain to be installed in a printed prototype vehicle. Furthermore, the integration of the powertrain and the opportunities and challenges it presents are described in this work. A comparison of offline simulation, HIL and chassis rolls results is presented to validate the development process. Chassis dynamometer results for battery electric and range extender operation are analyzed to show the benefits of the architecture.« less
A Plug and Play GNC Architecture Using FPGA Components
NASA Technical Reports Server (NTRS)
KrishnaKumar, K.; Kaneshige, J.; Waterman, R.; Pires, C.; Ippoloito, C.
2005-01-01
The goal of Plug and Play, or PnP, is to allow hardware and software components to work together automatically, without requiring manual setup procedures. As a result, new or replacement hardware can be plugged into a system and automatically configured with the appropriate resource assignments. However, in many cases it may not be practical or even feasible to physically replace hardware components. One method for handling these types of situations is through the incorporation of reconfigurable hardware such as Field Programmable Gate Arrays, or FPGAs. This paper describes a phased approach to developing a Guidance, Navigation, and Control (GNC) architecture that expands on the traditional concepts of PnP, in order to accommodate hardware reconfiguration without requiring detailed knowledge of the hardware. This is achieved by establishing a functional based interface that defines how the hardware will operate, and allow the hardware to reconfigure itself. The resulting system combines the flexibility of manipulating software components with the speed and efficiency of hardware.
Deep learning with coherent nanophotonic circuits
NASA Astrophysics Data System (ADS)
Shen, Yichen; Harris, Nicholas C.; Skirlo, Scott; Prabhu, Mihika; Baehr-Jones, Tom; Hochberg, Michael; Sun, Xin; Zhao, Shijie; Larochelle, Hugo; Englund, Dirk; Soljačić, Marin
2017-07-01
Artificial neural networks are computational network models inspired by signal processing in the brain. These models have dramatically improved performance for many machine-learning tasks, including speech and image recognition. However, today's computing hardware is inefficient at implementing neural networks, in large part because much of it was designed for von Neumann computing schemes. Significant effort has been made towards developing electronic architectures tuned to implement artificial neural networks that exhibit improved computational speed and accuracy. Here, we propose a new architecture for a fully optical neural network that, in principle, could offer an enhancement in computational speed and power efficiency over state-of-the-art electronics for conventional inference tasks. We experimentally demonstrate the essential part of the concept using a programmable nanophotonic processor featuring a cascaded array of 56 programmable Mach-Zehnder interferometers in a silicon photonic integrated circuit and show its utility for vowel recognition.
The design of a fast Level 1 Track trigger for the ATLAS High Luminosity Upgrade
NASA Astrophysics Data System (ADS)
Miller Allbrooke, Benedict Marc; ATLAS Collaboration
2017-10-01
The ATLAS experiment at the high-luminosity LHC will face a five-fold increase in the number of interactions per collision relative to the ongoing Run 2. This will require a proportional improvement in rejection power at the earliest levels of the detector trigger system, while preserving good signal efficiency, due to the increase in the likelihood of individual trigger thresholds being passed as a result of pile-up related activity. One critical aspect of this improvement will be the implementation of precise track reconstruction, through which sharper turn-on curves, b-tagging and tau-tagging techniques can in principle be implemented. The challenge of such a project comes in the development of a fast, precise custom electronic device integrated in the hardware-based first trigger level of the experiment, with repercussions propagating as far as the detector read-out philosophy.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barbara Chapman
OpenMP was not well recognized at the beginning of the project, around year 2003, because of its limited use in DoE production applications and the inmature hardware support for an efficient implementation. Yet in the recent years, it has been graduately adopted both in HPC applications, mostly in the form of MPI+OpenMP hybrid code, and in mid-scale desktop applications for scientific and experimental studies. We have observed this trend and worked deligiently to improve our OpenMP compiler and runtimes, as well as to work with the OpenMP standard organization to make sure OpenMP are evolved in the direction close tomore » DoE missions. In the Center for Programming Models for Scalable Parallel Computing project, the HPCTools team at the University of Houston (UH), directed by Dr. Barbara Chapman, has been working with project partners, external collaborators and hardware vendors to increase the scalability and applicability of OpenMP for multi-core (and future manycore) platforms and for distributed memory systems by exploring different programming models, language extensions, compiler optimizations, as well as runtime library support.« less
Using CLIPS in the domain of knowledge-based massively parallel programming
NASA Technical Reports Server (NTRS)
Dvorak, Jiri J.
1994-01-01
The Program Development Environment (PDE) is a tool for massively parallel programming of distributed-memory architectures. Adopting a knowledge-based approach, the PDE eliminates the complexity introduced by parallel hardware with distributed memory and offers complete transparency in respect of parallelism exploitation. The knowledge-based part of the PDE is realized in CLIPS. Its principal task is to find an efficient parallel realization of the application specified by the user in a comfortable, abstract, domain-oriented formalism. A large collection of fine-grain parallel algorithmic skeletons, represented as COOL objects in a tree hierarchy, contains the algorithmic knowledge. A hybrid knowledge base with rule modules and procedural parts, encoding expertise about application domain, parallel programming, software engineering, and parallel hardware, enables a high degree of automation in the software development process. In this paper, important aspects of the implementation of the PDE using CLIPS and COOL are shown, including the embedding of CLIPS with C++-based parts of the PDE. The appropriateness of the chosen approach and of the CLIPS language for knowledge-based software engineering are discussed.
Proteinortho: detection of (co-)orthologs in large-scale analysis.
Lechner, Marcus; Findeiss, Sven; Steiner, Lydia; Marz, Manja; Stadler, Peter F; Prohaska, Sonja J
2011-04-28
Orthology analysis is an important part of data analysis in many areas of bioinformatics such as comparative genomics and molecular phylogenetics. The ever-increasing flood of sequence data, and hence the rapidly increasing number of genomes that can be compared simultaneously, calls for efficient software tools as brute-force approaches with quadratic memory requirements become infeasible in practise. The rapid pace at which new data become available, furthermore, makes it desirable to compute genome-wide orthology relations for a given dataset rather than relying on relations listed in databases. The program Proteinortho described here is a stand-alone tool that is geared towards large datasets and makes use of distributed computing techniques when run on multi-core hardware. It implements an extended version of the reciprocal best alignment heuristic. We apply Proteinortho to compute orthologous proteins in the complete set of all 717 eubacterial genomes available at NCBI at the beginning of 2009. We identified thirty proteins present in 99% of all bacterial proteomes. Proteinortho significantly reduces the required amount of memory for orthology analysis compared to existing tools, allowing such computations to be performed on off-the-shelf hardware.
ATCA digital controller hardware for vertical stabilization of plasmas in tokamaks
DOE Office of Scientific and Technical Information (OSTI.GOV)
Batista, A. J. N.; Sousa, J.; Varandas, C. A. F.
2006-10-15
The efficient vertical stabilization (VS) of plasmas in tokamaks requires a fast reaction of the VS controller, for example, after detection of edge localized modes (ELM). For controlling the effects of very large ELMs a new digital control hardware, based on the Advanced Telecommunications Computing Architecture trade mark sign (ATCA), is being developed aiming to reduce the VS digital control loop cycle (down to an optimal value of 10 {mu}s) and improve the algorithm performance. The system has 1 ATCA trade mark sign processor module and up to 12 ATCA trade mark sign control modules, each one with 32 analogmore » input channels (12 bit resolution), 4 analog output channels (12 bit resolution), and 8 digital input/output channels. The Aurora trade mark sign and PCI Express trade mark sign communication protocols will be used for data transport, between modules, with expected latencies below 2 {mu}s. Control algorithms are implemented on a ix86 based processor with 6 Gflops and on field programmable gate arrays with 80 GMACS, interconnected by serial gigabit links in a full mesh topology.« less
Definition of an auxiliary processor dedicated to real-time operating system kernels
NASA Technical Reports Server (NTRS)
Halang, Wolfgang A.
1988-01-01
In order to increase the efficiency of process control data processing, it is necessary to enhance the productivity of real time high level languages and to automate the task administration, because presently 60 percent or more of the applications are still programmed in assembly languages. This may be achieved by migrating apt functions for the support of process control oriented languages into the hardware, i.e., by new architectures. Whereas numerous high level languages have already been defined or realized, there are no investigations yet on hardware assisted implementation of real time features. The requirements to be fulfilled by languages and operating systems in hard real time environment are summarized. A comparison of the most prominent languages, viz. Ada, HAL/S, LTR, Pearl, as well as the real time extensions of FORTRAN and PL/1, reveals how existing languages meet these demands and which features still need to be incorporated to enable the development of reliable software with predictable program behavior, thus making it possible to carry out a technical safety approval. Accordingly, Pearl proved to be the closest match to the mentioned requirements.
Digital Hardware Design Teaching: An Alternative Approach
ERIC Educational Resources Information Center
Benkrid, Khaled; Clayton, Thomas
2012-01-01
This article presents the design and implementation of a complete review of undergraduate digital hardware design teaching in the School of Engineering at the University of Edinburgh. Four guiding principles have been used in this exercise: learning-outcome driven teaching, deep learning, affordability, and flexibility. This has identified…
Using MaxCompiler for the high level synthesis of trigger algorithms
NASA Astrophysics Data System (ADS)
Summers, S.; Rose, A.; Sanders, P.
2017-02-01
Firmware for FPGA trigger applications at the CMS experiment is conventionally written using hardware description languages such as Verilog and VHDL. MaxCompiler is an alternative, Java based, tool for developing FPGA applications which uses a higher level of abstraction from the hardware than a hardware description language. An implementation of the jet and energy sum algorithms for the CMS Level-1 calorimeter trigger has been written using MaxCompiler to benchmark against the VHDL implementation in terms of accuracy, latency, resource usage, and code size. A Kalman Filter track fitting algorithm has been developed using MaxCompiler for a proposed CMS Level-1 track trigger for the High-Luminosity LHC upgrade. The design achieves a low resource usage, and has a latency of 187.5 ns per iteration.
Parallelizing alternating direction implicit solver on GPUs
USDA-ARS?s Scientific Manuscript database
We present a parallel Alternating Direction Implicit (ADI) solver on GPUs. Our implementation significantly improves existing implementations in two aspects. First, we address the scalability issue of existing Parallel Cyclic Reduction (PCR) implementations by eliminating their hardware resource con...
NASA Astrophysics Data System (ADS)
Deng, Zhiwei; Li, Xicai; Shi, Junsheng; Huang, Xiaoqiao; Li, Feiyan
2018-01-01
Depth measurement is the most basic measurement in various machine vision, such as automatic driving, unmanned aerial vehicle (UAV), robot and so on. And it has a wide range of use. With the development of image processing technology and the improvement of hardware miniaturization and processing speed, real-time depth measurement using dual cameras has become a reality. In this paper, an embedded AM5728 and the ordinary low-cost dual camera is used as the hardware platform. The related algorithms of dual camera calibration, image matching and depth calculation have been studied and implemented on the hardware platform, and hardware design and the rationality of the related algorithms of the system are tested. The experimental results show that the system can realize simultaneous acquisition of binocular images, switching of left and right video sources, display of depth image and depth range. For images with a resolution of 640 × 480, the processing speed of the system can be up to 25 fps. The experimental results show that the optimal measurement range of the system is from 0.5 to 1.5 meter, and the relative error of the distance measurement is less than 5%. Compared with the PC, ARM11 and DMCU hardware platforms, the embedded AM5728 hardware is good at meeting real-time depth measurement requirements in ensuring the image resolution.
Face classification using electronic synapses
NASA Astrophysics Data System (ADS)
Yao, Peng; Wu, Huaqiang; Gao, Bin; Eryilmaz, Sukru Burc; Huang, Xueyao; Zhang, Wenqiang; Zhang, Qingtian; Deng, Ning; Shi, Luping; Wong, H.-S. Philip; Qian, He
2017-05-01
Conventional hardware platforms consume huge amount of energy for cognitive learning due to the data movement between the processor and the off-chip memory. Brain-inspired device technologies using analogue weight storage allow to complete cognitive tasks more efficiently. Here we present an analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials. The device shows bidirectional continuous weight modulation behaviour. Grey-scale face classification is experimentally demonstrated using an integrated 1024-cell array with parallel online training. The energy consumption within the analogue synapses for each iteration is 1,000 × (20 ×) lower compared to an implementation using Intel Xeon Phi processor with off-chip memory (with hypothetical on-chip digital resistive random access memory). The accuracy on test sets is close to the result using a central processing unit. These experimental results consolidate the feasibility of analogue synaptic array and pave the way toward building an energy efficient and large-scale neuromorphic system.
Engineering the quantum states of light in a Kerr-nonlinear resonator by two-photon driving
NASA Astrophysics Data System (ADS)
Puri, Shruti; Boutin, Samuel; Blais, Alexandre
2017-04-01
Photonic cat states stored in high-Q resonators show great promise for hardware efficient universal quantum computing. We propose an approach to efficiently prepare such cat states in a Kerr-nonlinear resonator by the use of a two-photon drive. Significantly, we show that this preparation is robust against single-photon loss. An outcome of this observation is that a two-photon drive can eliminate undesirable phase evolution induced by a Kerr nonlinearity. By exploiting the concept of transitionless quantum driving, we moreover demonstrate how non-adiabatic initialization of cat states is possible. Finally, we present a universal set of quantum logical gates that can be performed on the engineered eigenspace of such a two-photon driven resonator and discuss a possible realization using superconducting circuits. The robustness of the engineered subspace to higher-order circuit nonlinearities makes this implementation favorable for scalable quantum computation.
Exploring Manycore Multinode Systems for Irregular Applications with FPGA Prototyping
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ceriani, Marco; Palermo, Gianluca; Secchi, Simone
We present a prototype of a multi-core architecture implemented on FPGA, designed to enable efficient execution of irregular applications on distributed shared memory machines, while maintaining high performance on regular workloads. The architecture is composed of off-the-shelf soft-core cores, local interconnection and memory interface, integrated with custom components that optimize it for irregular applications. It relies on three key elements: a global address space, multithreading, and fine-grained synchronization. Global addresses are scrambled to reduce the formation of network hot-spots, while the latency of the transactions is covered by integrating an hardware scheduler within the custom load/store buffers to take advantagemore » from the availability of multiple executions threads, increasing the efficiency in a transparent way to the application. We evaluated a dual node system irregular kernels showing scalability in the number of cores and threads.« less
3D Reconfigurable MPSoC for Unmanned Spacecraft Navigation
NASA Astrophysics Data System (ADS)
Dekoulis, George
2016-07-01
This paper describes the design of a new lightweight spacecraft navigation system for unmanned space missions. The system addresses the demands for more efficient autonomous navigation in the near-Earth environment or deep space. The proposed instrumentation is directly suitable for unmanned systems operation and testing of new airborne prototypes for remote sensing applications. The system features a new sensor technology and significant improvements over existing solutions. Fluxgate type sensors have been traditionally used in unmanned defense systems such as target drones, guided missiles, rockets and satellites, however, the guidance sensors' configurations exhibit lower specifications than the presented solution. The current implementation is based on a recently developed material in a reengineered optimum sensor configuration for unprecedented low-power consumption. The new sensor's performance characteristics qualify it for spacecraft navigation applications. A major advantage of the system is the efficiency in redundancy reduction achieved in terms of both hardware and software requirements.
Face classification using electronic synapses.
Yao, Peng; Wu, Huaqiang; Gao, Bin; Eryilmaz, Sukru Burc; Huang, Xueyao; Zhang, Wenqiang; Zhang, Qingtian; Deng, Ning; Shi, Luping; Wong, H-S Philip; Qian, He
2017-05-12
Conventional hardware platforms consume huge amount of energy for cognitive learning due to the data movement between the processor and the off-chip memory. Brain-inspired device technologies using analogue weight storage allow to complete cognitive tasks more efficiently. Here we present an analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials. The device shows bidirectional continuous weight modulation behaviour. Grey-scale face classification is experimentally demonstrated using an integrated 1024-cell array with parallel online training. The energy consumption within the analogue synapses for each iteration is 1,000 × (20 ×) lower compared to an implementation using Intel Xeon Phi processor with off-chip memory (with hypothetical on-chip digital resistive random access memory). The accuracy on test sets is close to the result using a central processing unit. These experimental results consolidate the feasibility of analogue synaptic array and pave the way toward building an energy efficient and large-scale neuromorphic system.
Theoretical and software considerations for nonlinear dynamic analysis
NASA Technical Reports Server (NTRS)
Schmidt, R. J.; Dodds, R. H., Jr.
1983-01-01
In the finite element method for structural analysis, it is generally necessary to discretize the structural model into a very large number of elements to accurately evaluate displacements, strains, and stresses. As the complexity of the model increases, the number of degrees of freedom can easily exceed the capacity of present-day software system. Improvements of structural analysis software including more efficient use of existing hardware and improved structural modeling techniques are discussed. One modeling technique that is used successfully in static linear and nonlinear analysis is multilevel substructuring. This research extends the use of multilevel substructure modeling to include dynamic analysis and defines the requirements for a general purpose software system capable of efficient nonlinear dynamic analysis. The multilevel substructuring technique is presented, the analytical formulations and computational procedures for dynamic analysis and nonlinear mechanics are reviewed, and an approach to the design and implementation of a general purpose structural software system is presented.